Unit 1 - Notes

INT234

Unit 1: Introduction and Data Preparation

1. Introduction to Predictive Analytics

1.1 Definition and Core Concepts

Predictive Analytics is a branch of advanced analytics that makes predictions about future outcomes using historical data combined with statistical modeling, data mining techniques, and machine learning.

  • Goal: To assess what will happen in the future based on what has happened in the past.
  • Key Output: It provides a probability score for each individual (customer, employee, healthcare patient, product SKU, etc.) to determine, inform, or influence organizational processes.

1.2 The Analytics Spectrum

Predictive analytics sits within a broader spectrum of business intelligence:

  1. Descriptive Analytics: What happened? (Reports, Dashboards).
  2. Diagnostic Analytics: Why did it happen? (Drill-down, Data discovery).
  3. Predictive Analytics: What is likely to happen? (Forecasting, Statistical modeling).
  4. Prescriptive Analytics: What should we do about it? (Optimization, Simulation).

1.3 The Predictive Analytics Process (CRISP-DM)

Most predictive projects follow the Cross-Industry Standard Process for Data Mining (CRISP-DM):

  1. Business Understanding: Defining the problem and objectives.
  2. Data Understanding: Collecting and exploring raw data.
  3. Data Preparation: Cleaning and formatting data (the focus of Unit 1).
  4. Modeling: Selecting and training algorithms.
  5. Evaluation: Testing the model's accuracy.
  6. Deployment: Integrating predictions into business operations.

2. Machine Learning Overview

2.1 Definition

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that learn from data. Instead of being explicitly programmed with rules (e.g., "If X > 5 then Y"), ML algorithms identify patterns in data to create their own rules.

2.2 Relationship: AI vs. ML vs. Deep Learning

  • AI: The broad concept of machines acting smartly.
  • Machine Learning: Algorithms that parse data, learn from it, and then make a determination or prediction.
  • Deep Learning: A subset of ML inspired by the structure of the human brain (Artificial Neural Networks), useful for complex data like images and text.

3. Types of Machine Learning

3.1 Supervised Learning

In supervised learning, the algorithm learns from a labeled dataset. This means the data includes both the input features (independent variables) and the correct answer/output (dependent variable).

  • How it works: The model makes predictions on training data and is corrected by the known labels until it achieves a desired level of accuracy.
  • Sub-categories:
    1. Regression: Used when the output variable is continuous (numerical).
      • Examples: Predicting house prices, stock market trends, temperature.
      • Algorithms: Linear Regression, Decision Tree Regressor.
    2. Classification: Used when the output variable is categorical (classes).
      • Examples: Spam detection (Spam/Not Spam), Churn prediction (Yes/No), Disease diagnosis.
      • Algorithms: Logistic Regression, Support Vector Machines (SVM), Random Forest, Naive Bayes.

3.2 Unsupervised Learning

In unsupervised learning, the algorithm is given an unlabeled dataset. There is no "correct answer" provided. The goal is to explore the data and find hidden structures or patterns.

  • How it works: The algorithm tries to group unsorted information according to similarities, patterns, and differences without any prior training of data.
  • Sub-categories:
    1. Clustering: Grouping inherent data points together.
      • Examples: Customer segmentation, grouping search results.
      • Algorithms: K-Means Clustering, Hierarchical Clustering.
    2. Association: Discovering rules that describe large portions of data.
      • Examples: Market Basket Analysis ("People who buy bread also buy butter").
      • Algorithms: Apriori, Eclat.
    3. Dimensionality Reduction: Reducing the number of random variables under consideration.
      • Examples: Compressing image data, feature extraction.
      • Algorithms: Principal Component Analysis (PCA).

3.3 Comparative Summary

Feature Supervised Learning Unsupervised Learning
Input Data Labeled (Input + Output) Unlabeled (Input only)
Goal Predict an outcome Find structure/patterns
Feedback Direct feedback (Correct/Incorrect) No feedback
Complexity Computationally simpler Computationally complex
Key Tasks Classification, Regression Clustering, Association

4. Data Preprocessing

Data Preprocessing is the technique of converting raw, messy data into a clean dataset. In the real world, data is often incomplete, inconsistent, and lacking in certain behaviors or trends.
Motto: "Garbage In, Garbage Out" (GIGO).

4.1 Data Cleaning

Handling missing values, noisy data, and outliers.

A. Missing Values

Data is missing due to corruption or failure to record.

  • Deletion:
    • List-wise deletion: Drop the entire row (risk of losing data).
    • Drop columns: If a column has >50% missing data.
  • Imputation: Filling in the missing data.
    • Mean/Median/Mode: Fill with the average (for numerical) or most frequent (for categorical).
    • Prediction: Use a linear regression model to predict the missing value based on other features.

PYTHON
# Python Example: Imputation using Pandas
import pandas as pd
df = pd.read_csv('data.csv')

# Fill missing numerical values with Mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill missing categorical values with Mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

B. Noisy Data / Outliers

Random error or variance in a measured variable.

  • Binning: Sorting data and smoothing it by consulting neighbors (e.g., smoothing by bin means).
  • Clustering: Detecting and removing outliers (points far from clusters).
  • IQR Method: removing data points that fall outside (Interquartile Range).

4.2 Data Transformation

Converting data into a format suitable for the algorithm.

A. Normalization (Min-Max Scaling)

Rescaling the features to a fixed range, usually [0, 1].

  • Use case: When algorithms use distance measures (e.g., K-Means, K-Nearest Neighbors) and features have different scales (e.g., Age: 0-100, Salary: 0-100,000).

B. Standardization (Z-Score Scaling)

Rescaling data to have a mean () of 0 and standard deviation () of 1.

  • Use case: When data follows a Gaussian (Bell) distribution (e.g., SVM, Logistic Regression).

C. Categorical Encoding

Machine learning models require numerical input. Text labels must be converted.

  1. Label Encoding: Assigning a unique integer to each category (e.g., Low=0, Medium=1, High=2). Caution: The model might misinterpret the order/magnitude.
  2. One-Hot Encoding: Creating a new binary column for each category. (e.g., Color_Red, Color_Blue).

PYTHON
# Python Example: Encoding
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Standardization
scaler = StandardScaler()
df['Salary_Scaled'] = scaler.fit_transform(df[['Salary']])

# One-Hot Encoding
df = pd.get_dummies(df, columns=['Gender'])

4.3 Data Reduction

Reducing the volume of data but producing the same or similar analytical results.

  1. Feature Selection: Selecting a subset of relevant features.
    • Filter Methods: Correlation matrix (remove highly correlated features).
    • Wrapper Methods: Recursive Feature Elimination.
  2. Dimensionality Reduction: Projecting high-dimensional data into lower space (e.g., PCA).

4.4 Data Splitting

Before modeling, data must be split to evaluate performance.

  1. Training Set (70-80%): Used to train the model.
  2. Test Set (20-30%): Used to evaluate the model on unseen data.
  3. Validation Set: Optional split used for hyperparameter tuning.

PYTHON
from sklearn.model_selection import train_test_split

X = df.drop('Target', axis=1) # Features
y = df['Target']              # Target Variable

# 80% Train, 20% Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)