Unit 1 - Notes

INT395

Unit 1: Introduction and Data Preprocessing

1. Overview of Supervised Learning

Definition

Supervised learning is a paradigm of machine learning where the model is trained on a labeled dataset. The algorithm learns a mapping function from input variables (, also called features) to output variables (, also called labels or targets).

  • Goal: To approximate the mapping function so well that when you have new input data (), you can predict the output variables () for that data.
  • The Teacher Analogy: The process is supervised because we know the correct answers. The algorithm makes predictions on the training data and is corrected by the "teacher" (the loss function) whenever the predictions are wrong.

Key Use Cases

  1. Email Filtering: Classifying emails as "Spam" or "Not Spam."
  2. Credit Scoring: Predicting if a customer will default on a loan based on credit history.
  3. Real Estate: Predicting house prices based on square footage, location, and number of bedrooms.
  4. Medical Diagnosis: Predicting disease presence (Positive/Negative) based on symptoms and test results.
  5. Image Recognition: Identifying objects (e.g., cats vs. dogs) within an image.

2. Types of Supervised Learning

Supervised learning problems are grouped based on the nature of the target variable ().

A. Classification

The output variable is a category (discrete value).

  • Binary Classification: Two possible classes (e.g., Yes/No, Churn/Retain).
  • Multi-class Classification: More than two classes (e.g., Categorizing news articles into Sports, Politics, Tech).
  • Algorithms: Logistic Regression, Decision Trees, Support Vector Machines (SVM), k-Nearest Neighbors (KNN), Random Forest.

B. Regression

The output variable is a real value (continuous value).

  • Examples: Predicting temperature, stock prices, age, or sales revenue.
  • Algorithms: Linear Regression, Polynomial Regression, Support Vector Regression (SVR), Decision Tree Regression.

3. Setting up Python and Scikit-learn Environment

To perform supervised learning, the Python ecosystem relies on a stack of scientific libraries.

Essential Libraries

  1. NumPy: Fundamental package for scientific computing (arrays, matrices).
  2. Pandas: Data manipulation and analysis (DataFrames).
  3. Matplotlib/Seaborn: Data visualization.
  4. Scikit-learn (sklearn): The core machine learning library containing algorithms and preprocessing tools.

Installation

BASH
pip install numpy pandas matplotlib seaborn scikit-learn jupyter

Basic Import Convention

PYTHON
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Specific models are imported as needed, e.g.:
# from sklearn.linear_model import LinearRegression


4. Types of Data

Understanding the nature of features is critical for preprocessing.

  1. Numerical (Quantitative):
    • Continuous: Can take any value within a range (e.g., Height, Weight, Price).
    • Discrete: Countable integers (e.g., Number of children, Number of purchases).
  2. Categorical (Qualitative):
    • Nominal: Categories with no intrinsic ordering (e.g., Color: Red/Blue/Green, Gender: M/F).
    • Ordinal: Categories with a clear order (e.g., T-shirt size: S/M/L/XL, Rating: Low/Medium/High).
  3. Time-Series: Data points indexed in time order (e.g., Daily stock closing prices).

5. Loading Datasets

Data is usually stored in CSV, Excel, or SQL databases.

Using Pandas

PYTHON
# Loading CSV
df = pd.read_csv('dataset.csv')

# Loading Excel
df = pd.read_excel('dataset.xlsx')

Using Scikit-learn (Toy Datasets)

Useful for practice and prototyping.

PYTHON
from sklearn.datasets import load_iris, load_boston

# Load Iris dataset (Classification)
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


6. Data Exploration (EDA) with Pandas and Visualization

Exploratory Data Analysis (EDA) allows us to understand the data structure, patterns, and anomalies before modeling.

Pandas Inspection Methods

  • df.head(): View the first 5 rows.
  • df.shape: Dimensions of the dataset (rows, columns).
  • df.info(): Summarizes data types and non-null counts per column.
  • df.describe(): Statistical summary (mean, std, min, max, quartiles) for numerical features.
  • df['column'].value_counts(): Frequency of unique values (for categorical data).

Visualization (Matplotlib & Seaborn)

  1. Histograms: Check distribution of numerical data (Normal vs. Skewed).
    PYTHON
        sns.histplot(df['age'], kde=True)
        
  2. Box Plots: Identify outliers and quartiles.
    PYTHON
        sns.boxplot(x=df['salary'])
        
  3. Scatter Plots: Analyze relationships between two numerical variables.
    PYTHON
        plt.scatter(df['experience'], df['salary'])
        
  4. Heatmaps: Visualize correlation matrices to find collinearity.
    PYTHON
        sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
        

7. Common Data Issues and Cleaning

Real-world data is "dirty." Models trained on bad data produce bad predictions (Garbage In, Garbage Out).

Common Issues

  1. Noise: Random error or variance in a measured variable.
  2. Inconsistency: Formatting differences (e.g., "NY", "N.Y.", "New York").
  3. Duplicates: Identical rows that bias the model.
    PYTHON
        df.drop_duplicates(inplace=True)
        

Handling Missing Values

Missing data is represented as NaN (Not a Number) or None.

Detection: df.isnull().sum()

Strategies:

  1. Dropping:
    • Remove rows (df.dropna()) if the dataset is large and missing rows are few.
    • Remove columns if >50% of the data is missing.
  2. Imputation (Filling):
    • Mean: Use for normally distributed numerical data.
    • Median: Use for skewed numerical data (robust to outliers).
    • Mode: Use for categorical data.
    • Model-based: Use KNN or Regression to predict missing values.

PYTHON
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])

Handling Outliers

Outliers are extreme values that deviate significantly from other observations.

  1. Detection:
    • Z-Score: Values beyond standard deviations.
    • IQR (Interquartile Range): Values or .
  2. Treatment:
    • Trimming: Remove the outlier rows.
    • Capping (Winsorizing): Set outliers to the 5th or 95th percentile value.
    • Log Transformation: Compresses the range of high-magnitude values.

8. Handling Class Imbalance

Occurs in classification when one class significantly outnumbers the other (e.g., Fraud Detection: 99% legitimate, 1% fraud). Models become biased toward the majority class.

Techniques

  1. Resampling:
    • Undersampling: Randomly remove samples from the majority class (loss of information).
    • Oversampling: Duplicating samples from the minority class.
  2. SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic (artificial) data points for the minority class by interpolating between existing samples.
  3. Class Weights: Modify the algorithm to penalize errors on the minority class more heavily (e.g., class_weight='balanced' in Scikit-learn).

9. Feature Scaling

Scaling ensures all features contribute equally to the result. It is crucial for distance-based algorithms (KNN, SVM, K-Means) and Gradient Descent optimization. Tree-based models (Random Forest) generally do not require scaling.

Techniques

  1. Standardization (Z-score Normalization):

    • Scales features to have Mean () = 0 and Standard Deviation () = 1.
    • Preferred when data follows a Gaussian distribution or has outliers.
    • Formula:
      PYTHON
          from sklearn.preprocessing import StandardScaler
          scaler = StandardScaler()
          
  2. Normalization (Min-Max Scaling):

    • Scales features to a range [0, 1].
    • Preferred for neural networks or image processing.
    • Formula:
      PYTHON
          from sklearn.preprocessing import MinMaxScaler
          scaler = MinMaxScaler()
          

10. Encoding Categorical Variables

Machine learning models require numerical input. Categorical text data must be converted.

1. Label Encoding

Assigns a unique integer to each category (e.g., Low=0, Medium=1, High=2).

  • Use case: Ordinal data (where order matters).
  • Library: sklearn.preprocessing.LabelEncoder

2. One-Hot Encoding

Creates a new binary column for each category.

  • Use case: Nominal data (where no order exists, e.g., Cities).
  • Issue: Can increase dimensionality significantly.
  • Library: pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.
  • Dummy Variable Trap: Multicollinearity caused by including all dummy variables. Solution: Drop one column (drop_first=True).

11. Feature Engineering

The art of creating new features from existing ones to improve model performance.

Examples:

  • Binning: Converting continuous age into groups (Child, Teen, Adult, Senior).
  • Polynomial Features: Creating interaction terms () to capture non-linear relationships.
  • Date/Time Extraction: Extracting 'Day of Week', 'Month', or 'Hour' from a timestamp.
  • Domain Specific: Creating a "Debt-to-Income Ratio" from "Total Debt" and "Income" columns.

12. Feature Selection

Selecting the most relevant subset of features to reduce overfitting, improve accuracy, and speed up training.

Methods

  1. Filter Methods: Use statistical tests to select features independent of the model.
    • Correlation Matrix: Remove features highly correlated with each other (redundant).
    • Chi-Square Test: For categorical features.
  2. Wrapper Methods: Evaluate specific subsets of features by training a model.
    • RFE (Recursive Feature Elimination): Recursively removes the least important features.
  3. Embedded Methods: Feature selection occurs during model training.
    • Lasso Regression (L1 Regularization): Shrinks coefficients of less important features to zero.
    • Tree-based Importance: Random Forests provide a feature_importances_ attribute.

13. Data Splitting

We must separate data to evaluate how the model performs on unseen data.

  • Training Set: Used to train the model (usually 70-80%).
  • Testing Set: Used to evaluate performance (usually 20-30%).
  • Validation Set: (Optional) Used for hyperparameter tuning during training.

PYTHON
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


14. Dimensionality Reduction

Reducing the number of input variables while retaining the meaningful information. Used to mitigate the Curse of Dimensionality.

Principal Component Analysis (PCA)

A linear unsupervised technique that projects data onto new orthogonal axes (Principal Components) that maximize variance.

  • Usage: Visualization (reducing to 2D/3D) or noise reduction.
  • Note: Data must be scaled (Standardized) before applying PCA.

PYTHON
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Linear Discriminant Analysis (LDA)

A supervised technique that finds a linear combination of features that separates two or more classes. Unlike PCA, LDA focuses on maximizing class separability.