Practice MCQ

Unit 1 - Notes

INT395 8 min read

Unit 1: Introduction and Data Preprocessing

1. Overview of Supervised Learning

Definition

Supervised learning is a paradigm of machine learning where the model is trained on a labeled dataset. The algorithm learns a mapping function $f$ from input variables ( $X$ , also called features) to output variables ( $y$ , also called labels or targets).

Goal: To approximate the mapping function so well that when you have new input data ( $X$ ), you can predict the output variables ( $y$ ) for that data.
The Teacher Analogy: The process is supervised because we know the correct answers. The algorithm makes predictions on the training data and is corrected by the "teacher" (the loss function) whenever the predictions are wrong.

Key Use Cases

Email Filtering: Classifying emails as "Spam" or "Not Spam."
Credit Scoring: Predicting if a customer will default on a loan based on credit history.
Real Estate: Predicting house prices based on square footage, location, and number of bedrooms.
Medical Diagnosis: Predicting disease presence (Positive/Negative) based on symptoms and test results.
Image Recognition: Identifying objects (e.g., cats vs. dogs) within an image.

2. Types of Supervised Learning

Supervised learning problems are grouped based on the nature of the target variable ( $y$ ).

A. Classification

The output variable is a category (discrete value).

Binary Classification: Two possible classes (e.g., Yes/No, Churn/Retain).
Multi-class Classification: More than two classes (e.g., Categorizing news articles into Sports, Politics, Tech).
Algorithms: Logistic Regression, Decision Trees, Support Vector Machines (SVM), k-Nearest Neighbors (KNN), Random Forest.

B. Regression

The output variable is a real value (continuous value).

Examples: Predicting temperature, stock prices, age, or sales revenue.
Algorithms: Linear Regression, Polynomial Regression, Support Vector Regression (SVR), Decision Tree Regression.

3. Setting up Python and Scikit-learn Environment

To perform supervised learning, the Python ecosystem relies on a stack of scientific libraries.

Essential Libraries

NumPy: Fundamental package for scientific computing (arrays, matrices).
Pandas: Data manipulation and analysis (DataFrames).
Matplotlib/Seaborn: Data visualization.
Scikit-learn (sklearn): The core machine learning library containing algorithms and preprocessing tools.

Installation

BASH

pip install numpy pandas matplotlib seaborn scikit-learn jupyter

Basic Import Convention

PYTHON

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Specific models are imported as needed, e.g.:
# from sklearn.linear_model import LinearRegression

4. Types of Data

Understanding the nature of features is critical for preprocessing.

Numerical (Quantitative):
- Continuous: Can take any value within a range (e.g., Height, Weight, Price).
- Discrete: Countable integers (e.g., Number of children, Number of purchases).
Categorical (Qualitative):
- Nominal: Categories with no intrinsic ordering (e.g., Color: Red/Blue/Green, Gender: M/F).
- Ordinal: Categories with a clear order (e.g., T-shirt size: S/M/L/XL, Rating: Low/Medium/High).
Time-Series: Data points indexed in time order (e.g., Daily stock closing prices).

5. Loading Datasets

Data is usually stored in CSV, Excel, or SQL databases.

Using Pandas

PYTHON

# Loading CSV
df = pd.read_csv('dataset.csv')

# Loading Excel
df = pd.read_excel('dataset.xlsx')

Using Scikit-learn (Toy Datasets)

Useful for practice and prototyping.

PYTHON

from sklearn.datasets import load_iris, load_boston

# Load Iris dataset (Classification)
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

6. Data Exploration (EDA) with Pandas and Visualization

Exploratory Data Analysis (EDA) allows us to understand the data structure, patterns, and anomalies before modeling.

Pandas Inspection Methods

df.head(): View the first 5 rows.
df.shape: Dimensions of the dataset (rows, columns).
df.info(): Summarizes data types and non-null counts per column.
df.describe(): Statistical summary (mean, std, min, max, quartiles) for numerical features.
df['column'].value_counts(): Frequency of unique values (for categorical data).

Visualization (Matplotlib & Seaborn)

Histograms: Check distribution of numerical data (Normal vs. Skewed).
PYTHON
```
    sns.histplot(df['age'], kde=True)
    
```
Box Plots: Identify outliers and quartiles.
PYTHON
```
    sns.boxplot(x=df['salary'])
    
```
Scatter Plots: Analyze relationships between two numerical variables.
PYTHON
```
    plt.scatter(df['experience'], df['salary'])
    
```

Heatmaps: Visualize correlation matrices to find collinearity.

PYTHON

    sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

7. Common Data Issues and Cleaning

Real-world data is "dirty." Models trained on bad data produce bad predictions (Garbage In, Garbage Out).

Common Issues

Noise: Random error or variance in a measured variable.
Inconsistency: Formatting differences (e.g., "NY", "N.Y.", "New York").
Duplicates: Identical rows that bias the model.
PYTHON
```
    df.drop_duplicates(inplace=True)
    
```

Handling Missing Values

Missing data is represented as NaN (Not a Number) or None.

Detection: df.isnull().sum()

Strategies:

Dropping:
- Remove rows (df.dropna()) if the dataset is large and missing rows are few.
- Remove columns if >50% of the data is missing.
Imputation (Filling):
- Mean: Use for normally distributed numerical data.
- Median: Use for skewed numerical data (robust to outliers).
- Mode: Use for categorical data.
- Model-based: Use KNN or Regression to predict missing values.

PYTHON

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])

Handling Outliers

Outliers are extreme values that deviate significantly from other observations.

Detection:
- Z-Score: Values beyond $\pm 3$ standard deviations.
- IQR (Interquartile Range): Values $< Q1 - 1.5 \times IQR$ or $> Q3 + 1.5 \times IQR$ .
Treatment:
- Trimming: Remove the outlier rows.
- Capping (Winsorizing): Set outliers to the 5th or 95th percentile value.
- Log Transformation: Compresses the range of high-magnitude values.

8. Handling Class Imbalance

Occurs in classification when one class significantly outnumbers the other (e.g., Fraud Detection: 99% legitimate, 1% fraud). Models become biased toward the majority class.

Techniques

Resampling:
- Undersampling: Randomly remove samples from the majority class (loss of information).
- Oversampling: Duplicating samples from the minority class.
SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic (artificial) data points for the minority class by interpolating between existing samples.
Class Weights: Modify the algorithm to penalize errors on the minority class more heavily (e.g., class_weight='balanced' in Scikit-learn).

9. Feature Scaling

Scaling ensures all features contribute equally to the result. It is crucial for distance-based algorithms (KNN, SVM, K-Means) and Gradient Descent optimization. Tree-based models (Random Forest) generally do not require scaling.

Techniques

Standardization (Z-score Normalization):
- Scales features to have Mean ( $\mu$ ) = 0 and Standard Deviation ( $\sigma$ ) = 1.
- Preferred when data follows a Gaussian distribution or has outliers.
- Formula:
  PYTHON
```
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    
```
Normalization (Min-Max Scaling):
- Scales features to a range [0, 1].
- Preferred for neural networks or image processing.
- Formula:
  PYTHON
```
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    
```

10. Encoding Categorical Variables

Machine learning models require numerical input. Categorical text data must be converted.

1. Label Encoding

Assigns a unique integer to each category (e.g., Low=0, Medium=1, High=2).

Use case: Ordinal data (where order matters).
Library: sklearn.preprocessing.LabelEncoder

2. One-Hot Encoding

Creates a new binary column for each category.

Use case: Nominal data (where no order exists, e.g., Cities).
Issue: Can increase dimensionality significantly.
Library: pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.
Dummy Variable Trap: Multicollinearity caused by including all dummy variables. Solution: Drop one column (drop_first=True).

11. Feature Engineering

The art of creating new features from existing ones to improve model performance.

Examples:

Binning: Converting continuous age into groups (Child, Teen, Adult, Senior).
Polynomial Features: Creating interaction terms ( $x_1^2, x_1 \cdot x_2$ ) to capture non-linear relationships.
Date/Time Extraction: Extracting 'Day of Week', 'Month', or 'Hour' from a timestamp.
Domain Specific: Creating a "Debt-to-Income Ratio" from "Total Debt" and "Income" columns.

12. Feature Selection

Selecting the most relevant subset of features to reduce overfitting, improve accuracy, and speed up training.

Methods

Filter Methods: Use statistical tests to select features independent of the model.
- Correlation Matrix: Remove features highly correlated with each other (redundant).
- Chi-Square Test: For categorical features.
Wrapper Methods: Evaluate specific subsets of features by training a model.
- RFE (Recursive Feature Elimination): Recursively removes the least important features.
Embedded Methods: Feature selection occurs during model training.
- Lasso Regression (L1 Regularization): Shrinks coefficients of less important features to zero.
- Tree-based Importance: Random Forests provide a feature_importances_ attribute.

13. Data Splitting

We must separate data to evaluate how the model performs on unseen data.

Training Set: Used to train the model (usually 70-80%).
Testing Set: Used to evaluate performance (usually 20-30%).
Validation Set: (Optional) Used for hyperparameter tuning during training.

PYTHON

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

14. Dimensionality Reduction

Reducing the number of input variables while retaining the meaningful information. Used to mitigate the Curse of Dimensionality.

Principal Component Analysis (PCA)

A linear unsupervised technique that projects data onto new orthogonal axes (Principal Components) that maximize variance.

Usage: Visualization (reducing to 2D/3D) or noise reduction.
Note: Data must be scaled (Standardized) before applying PCA.

PYTHON

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Linear Discriminant Analysis (LDA)

A supervised technique that finds a linear combination of features that separates two or more classes. Unlike PCA, LDA focuses on maximizing class separability.

Unit 2

Unit 1 - Notes

Table of Contents

Unit 1: Introduction and Data Preprocessing

1. Overview of Supervised Learning

Definition

Key Use Cases

2. Types of Supervised Learning

A. Classification

B. Regression

3. Setting up Python and Scikit-learn Environment

Essential Libraries

Installation

Basic Import Convention

4. Types of Data

5. Loading Datasets

Using Pandas

Using Scikit-learn (Toy Datasets)

6. Data Exploration (EDA) with Pandas and Visualization

Pandas Inspection Methods

Visualization (Matplotlib & Seaborn)

7. Common Data Issues and Cleaning

Common Issues

Handling Missing Values

Handling Outliers

8. Handling Class Imbalance

Techniques

9. Feature Scaling

Techniques

10. Encoding Categorical Variables

1. Label Encoding

2. One-Hot Encoding

11. Feature Engineering

12. Feature Selection

Methods

13. Data Splitting

14. Dimensionality Reduction

Principal Component Analysis (PCA)

Linear Discriminant Analysis (LDA)