Unit 6 - Notes

INT395

Unit 6: Pipelines, Model Evaluation and Model Deployment

1. Streamlining Workflows with Pipelines

In machine learning, data rarely arrives in a format ready for modeling. It requires preprocessing (imputation, scaling, encoding) before being fed into an algorithm. A Pipeline chains these steps together into a single object.

1.1 The Concept of Pipelines

A pipeline sequentially applies a list of transformers and a final estimator. Intermediate steps of the pipeline must be ‘transformers’ (they must implement fit and transform methods), while the final step only needs to implement fit.

Key Benefits:

  • Prevention of Data Leakage: This is the most critical advantage. When performing Cross-Validation, preprocessing parameters (like Mean for imputation or Min/Max for scaling) must be calculated only on the training fold and applied to the validation fold. Doing this manually is error-prone. Pipelines automate this encapsulation.
  • Convenience and Reproducibility: You can fit and predict using a single object, ensuring the exact same transformations are applied to new data in production as were applied during training.
  • Hyperparameter Tuning: You can perform Grid Search over the parameters of the preprocessing steps and the model simultaneously (e.g., finding the best imputation strategy and the best regularization parameter).

1.2 Components

  1. Transformers: Steps that alter the data (e.g., StandardScaler, OneHotEncoder, SimpleImputer).
  2. Estimator: The final step that learns from the data (e.g., LogisticRegression, RandomForestClassifier).

1.3 Implementation Example (Scikit-Learn)

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Define steps as a list of tuples (name, object)
steps = [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
]

pipe = Pipeline(steps)

# The pipeline acts as a single model
# pipe.fit(X_train, y_train)
# y_pred = pipe.predict(X_test)


2. Cross-Validation Strategies

Cross-Validation (CV) provides a more reliable estimate of model performance than a simple Train/Test split by reducing the variance associated with how the data is split.

2.1 K-Fold Cross-Validation

  • Mechanism: The dataset is split into equal partitions (folds). The model is trained on folds and validated on the remaining fold. This process is repeated times, ensuring every fold serves as the validation set exactly once.
  • Result: The final performance metric is the average of the scores.
  • Standard K: Usually 5 or 10.

2.2 Stratified K-Fold

  • Use Case: Essential for imbalanced datasets (classification problems).
  • Mechanism: Ensures that the proportion of samples for each class is roughly the same in each fold as it is in the whole dataset.
  • Why: Prevents a scenario where a validation fold contains none (or all) of the minority class instances, which would lead to misleading evaluation scores.

2.3 Leave-One-Out Cross-Validation (LOOCV)

  • Mechanism: A special case of K-Fold where equals the number of observations (). In each iteration, the model is trained on samples and tested on the single remaining sample.
  • Pros: Unbiased estimate of performance; no randomness in the split.
  • Cons: Extremely computationally expensive for large datasets; high variance in the estimation of the prediction error.

2.4 Time Series Split

  • Use Case: Temporal data (stock prices, weather forecasting).
  • Constraint: Standard K-Fold cannot be used because it shuffles data, causing "look-ahead bias" (training on future data to predict the past).
  • Mechanism: The training set expands in each iteration.
    • Split 1: Train on , Validate on .
    • Split 2: Train on , Validate on .

3. Debugging Algorithms: Learning and Validation Curves

Diagnosing whether a model is suffering from Underfitting (High Bias) or Overfitting (High Variance) is crucial for improving performance.

3.1 Learning Curves

A Learning Curve plots the model's performance (score) on the y-axis against the training set size on the x-axis. It displays two lines: Training Score and Validation Score.

  • Scenario A: High Bias (Underfitting)

    • Visual: Both Training and Validation scores are low. The curves converge quickly and flatten out.
    • Interpretation: The model is too simple to capture the underlying pattern.
    • Fix: Add features, increase model complexity (e.g., polynomial features), or decrease regularization. Adding more data will not help.
  • Scenario B: High Variance (Overfitting)

    • Visual: Training score is very high (near perfect), but Validation score is low. There is a large "gap" between the two curves.
    • Interpretation: The model memorized the training noise rather than generalizing.
    • Fix: Add more training data (likely to help close the gap), increase regularization, perform feature selection/reduction.

3.2 Validation Curves

A Validation Curve plots the model's performance on the y-axis against the values of a specific hyperparameter on the x-axis (e.g., tree depth, alpha in Ridge regression).

  • Purpose: To find the "Sweet Spot" (optimal complexity).
  • Interpretation:
    • Left side (Low complexity): Low Training score, Low Validation score (Underfitting).
    • Right side (High complexity): High Training score, Dropping Validation score (Overfitting).
    • Peak: The hyperparameter value where the Validation score is maximized before it starts to degrade due to overfitting.

4. Importance of Model Deployment

Deployment is the transition of a machine learning model from a development environment (e.g., Jupyter Notebook) to a production environment where it can interact with real-world applications.

4.1 Why Deployment Matters

  • Value Realization: A model sitting in a notebook generates no business value (ROI). Deployment integrates the intelligence into products or decision-making processes.
  • Feedback Loop: Real-world usage generates new data, which is essential for monitoring Data Drift (changes in input distribution) and Concept Drift (changes in the relationship between inputs and targets).
  • Scalability: Production environments allow the model to handle requests from thousands of users simultaneously.

4.2 The Challenge

The "It works on my machine" problem. Deployment requires managing dependencies (libraries, versions), hardware differences, and latency requirements.


5. Model Serialization and Deserialization

To deploy a model, we must save the model object (trained weights and architecture) to a file system and reload it later.

5.1 Serialization (Pickling)

The process of converting a Python object hierarchy (the trained model) into a byte stream.

  • Tools:
    • pickle: Standard Python module.
    • joblib: Often preferred for Scikit-Learn estimators. It is more efficient for objects containing large NumPy arrays.

5.2 Deserialization (Unpickling)

The inverse operation: reconstructing the Python object from the byte stream.

5.3 Code Example (Joblib)

PYTHON
import joblib
from sklearn.ensemble import RandomForestClassifier

# 1. Train model
model = RandomForestClassifier()
model.fit(X, y)

# 2. Serialize (Save to disk)
joblib.dump(model, 'random_forest_v1.pkl')

# --- Later, in a different script or server ---

# 3. Deserialize (Load from disk)
loaded_model = joblib.load('random_forest_v1.pkl')

# 4. Use model
prediction = loaded_model.predict(new_data)

Security Note: Never unpickle data received from an untrusted or unauthenticated source. Malicious code can be executed during unpickling.


6. Deployment Strategies

6.1 Local Deployment

Running the model on the same machine where the application resides.

  • Implementation: The model is loaded directly into the application script.
  • Use Case: IoT devices, mobile apps (edge computing), or offline batch processing scripts.
  • Pros: Zero network latency, privacy (data doesn't leave the device).
  • Cons: Limited by local hardware resources; updating the model requires updating the application software.

6.2 Web Service Deployment (API)

The most common pattern. The model is wrapped in a REST API (using frameworks like Flask, FastAPI, or Django).

  • Workflow:
    1. Client sends data via HTTP POST request (JSON).
    2. Server receives JSON, converts it to a DataFrame/Array.
    3. Model predicts.
    4. Server returns prediction as JSON.
  • Containerization (Docker): To ensure consistency, the API and model are usually packaged in a Docker container. This creates a portable image containing the OS, libraries, and code.
  • Pros: Centralized control, easy to update the model without changing client apps, scalable.
  • Cons: Network latency, requires server infrastructure management.

6.3 Serverless Deployment

Using cloud-native "Functions as a Service" (FaaS) like AWS Lambda, Google Cloud Functions, or Azure Functions.

  • Concept: You upload the code (prediction function) and the model file. The cloud provider dynamically manages the allocation of machine resources. The function "spins up" when triggered by a request and shuts down when done.
  • Pros:
    • Cost-efficient: You pay only for the compute time used (milliseconds), not for idle servers.
    • No Ops: No server management or patching.
    • Auto-scaling: Automatically handles spikes in traffic.
  • Cons:
    • Cold Starts: If the function hasn't been used recently, there is a delay (latency) while the container starts up and loads the model into memory.
    • Size Limits: Serverless functions often have strict limits on deployment package size (e.g., difficult to deploy massive Deep Learning models without workarounds like loading from S3).