Practice MCQ

Unit 6 - Notes

INT395 7 min read

Unit 6: Pipelines, Model Evaluation and Model Deployment

1. Streamlining Workflows with Pipelines

In machine learning, data rarely arrives in a format ready for modeling. It requires preprocessing (imputation, scaling, encoding) before being fed into an algorithm. A Pipeline chains these steps together into a single object.

1.1 The Concept of Pipelines

A pipeline sequentially applies a list of transformers and a final estimator. Intermediate steps of the pipeline must be ‘transformers’ (they must implement fit and transform methods), while the final step only needs to implement fit.

Key Benefits:

Prevention of Data Leakage: This is the most critical advantage. When performing Cross-Validation, preprocessing parameters (like Mean for imputation or Min/Max for scaling) must be calculated only on the training fold and applied to the validation fold. Doing this manually is error-prone. Pipelines automate this encapsulation.
Convenience and Reproducibility: You can fit and predict using a single object, ensuring the exact same transformations are applied to new data in production as were applied during training.
Hyperparameter Tuning: You can perform Grid Search over the parameters of the preprocessing steps and the model simultaneously (e.g., finding the best imputation strategy and the best regularization parameter).

1.2 Components

Transformers: Steps that alter the data (e.g., StandardScaler, OneHotEncoder, SimpleImputer).
Estimator: The final step that learns from the data (e.g., LogisticRegression, RandomForestClassifier).

1.3 Implementation Example (Scikit-Learn)

PYTHON

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Define steps as a list of tuples (name, object)
steps = [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
]

pipe = Pipeline(steps)

# The pipeline acts as a single model
# pipe.fit(X_train, y_train)
# y_pred = pipe.predict(X_test)

2. Cross-Validation Strategies

Cross-Validation (CV) provides a more reliable estimate of model performance than a simple Train/Test split by reducing the variance associated with how the data is split.

2.1 K-Fold Cross-Validation

Mechanism: The dataset is split into $K$ equal partitions (folds). The model is trained on $K-1$ folds and validated on the remaining fold. This process is repeated $K$ times, ensuring every fold serves as the validation set exactly once.
Result: The final performance metric is the average of the $K$ scores.
Standard K: Usually 5 or 10.

2.2 Stratified K-Fold

Use Case: Essential for imbalanced datasets (classification problems).
Mechanism: Ensures that the proportion of samples for each class is roughly the same in each fold as it is in the whole dataset.
Why: Prevents a scenario where a validation fold contains none (or all) of the minority class instances, which would lead to misleading evaluation scores.

2.3 Leave-One-Out Cross-Validation (LOOCV)

Mechanism: A special case of K-Fold where $K$ equals the number of observations ( $N$ ). In each iteration, the model is trained on $N-1$ samples and tested on the single remaining sample.
Pros: Unbiased estimate of performance; no randomness in the split.
Cons: Extremely computationally expensive for large datasets; high variance in the estimation of the prediction error.

2.4 Time Series Split

Use Case: Temporal data (stock prices, weather forecasting).
Constraint: Standard K-Fold cannot be used because it shuffles data, causing "look-ahead bias" (training on future data to predict the past).
Mechanism: The training set expands in each iteration.
- Split 1: Train on $[0, T]$ , Validate on $[T+1]$ .
- Split 2: Train on $[0, T+1]$ , Validate on $[T+2]$ .

3. Debugging Algorithms: Learning and Validation Curves

Diagnosing whether a model is suffering from Underfitting (High Bias) or Overfitting (High Variance) is crucial for improving performance.

3.1 Learning Curves

A Learning Curve plots the model's performance (score) on the y-axis against the training set size on the x-axis. It displays two lines: Training Score and Validation Score.

Scenario A: High Bias (Underfitting)
- Visual: Both Training and Validation scores are low. The curves converge quickly and flatten out.
- Interpretation: The model is too simple to capture the underlying pattern.
- Fix: Add features, increase model complexity (e.g., polynomial features), or decrease regularization. Adding more data will not help.
Scenario B: High Variance (Overfitting)
- Visual: Training score is very high (near perfect), but Validation score is low. There is a large "gap" between the two curves.
- Interpretation: The model memorized the training noise rather than generalizing.
- Fix: Add more training data (likely to help close the gap), increase regularization, perform feature selection/reduction.

3.2 Validation Curves

A Validation Curve plots the model's performance on the y-axis against the values of a specific hyperparameter on the x-axis (e.g., tree depth, alpha in Ridge regression).

Purpose: To find the "Sweet Spot" (optimal complexity).
Interpretation:
- Left side (Low complexity): Low Training score, Low Validation score (Underfitting).
- Right side (High complexity): High Training score, Dropping Validation score (Overfitting).
- Peak: The hyperparameter value where the Validation score is maximized before it starts to degrade due to overfitting.

4. Importance of Model Deployment

Deployment is the transition of a machine learning model from a development environment (e.g., Jupyter Notebook) to a production environment where it can interact with real-world applications.

4.1 Why Deployment Matters

Value Realization: A model sitting in a notebook generates no business value (ROI). Deployment integrates the intelligence into products or decision-making processes.
Feedback Loop: Real-world usage generates new data, which is essential for monitoring Data Drift (changes in input distribution) and Concept Drift (changes in the relationship between inputs and targets).
Scalability: Production environments allow the model to handle requests from thousands of users simultaneously.

4.2 The Challenge

The "It works on my machine" problem. Deployment requires managing dependencies (libraries, versions), hardware differences, and latency requirements.

5. Model Serialization and Deserialization

To deploy a model, we must save the model object (trained weights and architecture) to a file system and reload it later.

5.1 Serialization (Pickling)

The process of converting a Python object hierarchy (the trained model) into a byte stream.

Tools:
- pickle: Standard Python module.
- joblib: Often preferred for Scikit-Learn estimators. It is more efficient for objects containing large NumPy arrays.

5.2 Deserialization (Unpickling)

The inverse operation: reconstructing the Python object from the byte stream.

5.3 Code Example (Joblib)