Unit 5 - Notes

CSE275 12 min read

Unit 5: Optimization for Machine Learning and AutoML

1. Hyperparameter Optimization (HPO)

Hyperparameter Optimization, also known as hyperparameter tuning, is the process of finding a set of optimal hyperparameters for a learning algorithm. The performance of a machine learning model is highly dependent on the choice of these hyperparameters.

1.1. Model Parameters vs. Hyperparameters

It is crucial to distinguish between model parameters and hyperparameters.

Feature	Model Parameters	Hyperparameters
Definition	Variables that the learning algorithm learns from the training data.	Configuration variables that are external to the model and set before training.
Source	Learned/estimated from data.	Set by the practitioner.
Purpose	Used by the model to make predictions.	Control the learning process itself (e.g., complexity, speed).
Examples	- Weights and biases in a neural network. - Coefficients in a linear regression model. - Split points in a decision tree.	- Learning rate in gradient descent. - `C` and `gamma` in an SVM. - Number of trees in a Random Forest. - Number of hidden layers in a neural network.
How they are set	Through an optimization algorithm like Gradient Descent during training.	Through tuning techniques like Grid Search, Random Search, or Bayesian Optimization.

1.2. The HPO Problem

Formally, HPO can be seen as an optimization problem. Let A be a learning algorithm with a hyperparameter configuration space Λ. Let a specific set of hyperparameters be λ ∈ Λ. The goal of HPO is to find the optimal set of hyperparameters λ* that minimizes a loss function L on a validation dataset D_valid after the model A_λ has been trained on a training dataset D_train.

Objective:
λ* = argmin_{λ ∈ Λ} L(A_λ(D_train), D_valid)

This means we want to find the hyperparameters that result in a model with the best performance on unseen data.

2. Foundational HPO Techniques: Grid Search and Random Search

These are the most common and straightforward methods for HPO.

2.1. Grid Search

Grid Search performs an exhaustive search through a manually specified subset of the hyperparameter space.

How it works:
1. Define a discrete grid of hyperparameter values.
2. For every combination of values in the grid, train a model.
3. Evaluate each model on a validation set.
4. Select the hyperparameter combination that resulted in the best performance.
Example: For an SVM with hyperparameters C and gamma:
- C values: [0.1, 1, 10]
- gamma values: [0.001, 0.01]
- Grid search will train and evaluate 3 * 2 = 6 different models.
Pros:
- Simplicity: Easy to understand and implement.
- Exhaustiveness: Guaranteed to find the best combination within the specified grid.
Cons:
- Curse of Dimensionality: The number of combinations grows exponentially with the number of hyperparameters. If you have 5 hyperparameters with 5 values each, you need to train 5^5 = 3125 models.
- Inefficient: Spends most of its time evaluating unpromising regions of the search space.
- Discrete Nature: May miss the optimal value if it lies between two grid points.

PYTHON

# Conceptual Python Code for Grid Search
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

# Instantiate the grid search model
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)

# Fit the model
grid.fit(X_train, y_train)

# Print best parameters
print(grid.best_params_)

2.2. Random Search

Random Search samples a fixed number of hyperparameter combinations from specified statistical distributions.

How it works:
1. Define a search space for each hyperparameter, often as a statistical distribution (e.g., uniform for continuous values).
2. Specify the number of iterations (n_iter) to perform.
3. In each iteration, randomly sample a combination of hyperparameters from their respective distributions.
4. Train and evaluate a model with the sampled combination.
5. Select the combination that yielded the best performance after n_iter trials.
Why it's often better than Grid Search: Research by Bergstra and Bengio (2012) showed that for most problems, only a few hyperparameters significantly impact model performance.
- Grid Search wastes evaluations on unimportant parameters.
- Random Search explores the space more effectively, allowing for a more fine-grained search of the important parameters for the same computational budget.
Pros:
- Efficiency: More likely to find a good hyperparameter set in fewer iterations than Grid Search.
- Flexibility: Works with continuous and discrete hyperparameters.
- Scalability: Better suited for high-dimensional search spaces.
Cons:
- Stochastic: Not guaranteed to find the optimal value. The quality of the result depends on the number of iterations and random chance.
- Uninformed: Like Grid Search, it doesn't learn from past evaluations to inform future choices.

PYTHON

# Conceptual Python Code for Random Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Define the parameter distributions
param_dist = {
    'C': uniform(loc=0, scale=100), # Uniform distribution from 0 to 100
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

# Instantiate the random search model
random_search = RandomizedSearchCV(SVC(), param_distributions=param_dist, n_iter=100, cv=5)

# Fit the model
random_search.fit(X_train, y_train)

# Print best parameters
print(random_search.best_params_)

2.3. Limitations of Grid and Random Search

Uninformed Search: Each trial is independent. The search process does not learn from previous evaluations to guide the search towards more promising regions of the hyperparameter space.
Static Search Space: The search space is defined once at the beginning and is not adapted during the search.
Inefficient: A significant portion of the computational budget is often wasted on evaluating poor hyperparameter combinations.

3. Advanced HPO Techniques

To overcome the limitations of basic search methods, more intelligent, model-based optimization techniques are used.

3.1. Evolutionary Hyperparameter Tuning

These techniques are inspired by the process of natural selection. They maintain a "population" of hyperparameter sets and iteratively evolve them to find better solutions.

Core Concepts:
- Population: A collection of candidate solutions (i.e., sets of hyperparameters).
- Individual: A single candidate solution within the population.
- Fitness: A score assigned to an individual, typically the model's performance on a validation set (e.g., accuracy or negative loss).
- Generation: One iteration of the evolutionary process.
Algorithm Steps:
1. Initialization: Create an initial population of N individuals by randomly sampling hyperparameter sets.
2. Evaluation (Fitness Calculation): For each individual in the population, train a model with its hyperparameters and calculate its fitness score.
3. Selection: Select the "fittest" individuals to be "parents" for the next generation. Individuals with better fitness scores have a higher probability of being selected.
4. Crossover (Recombination): Create new "offspring" by combining the hyperparameters of two parent individuals. For example, a new learning_rate could be the average of the parents' learning_rate.
5. Mutation: Introduce small, random changes into the offspring's hyperparameters. This helps maintain diversity in the population and avoid getting stuck in local optima.
6. Replacement: The new generation of offspring replaces the less-fit individuals from the previous generation.
7. Termination: Repeat steps 2-6 for a fixed number of generations or until the fitness score converges.
Pros:
- Effective for complex, non-convex, and high-dimensional search spaces.
- Can escape local optima due to mutation and crossover mechanisms.
- Naturally parallelizable.
Cons:
- Can be computationally expensive and slow to converge.
- Introduces its own set of "meta-hyperparameters" (e.g., population size, mutation rate, crossover rate) that need to be set.

3.2. Bayesian Optimization (Conceptual)

Bayesian Optimization is a powerful, model-based technique for finding the maximum or minimum of "black-box" functions that are expensive to evaluate. HPO is a perfect use case.

The Key Idea: Instead of blindly searching the hyperparameter space, Bayesian Optimization builds a probabilistic model of the relationship between hyperparameters and the model's performance. It then uses this model to intelligently select the most promising hyperparameters to evaluate next.
Two Core Components:
1. Surrogate Model (Probabilistic Model): This is a cheap-to-evaluate approximation of the expensive objective function (model performance).
  - The most common choice is a Gaussian Process (GP).
  - A GP models the objective function by defining a probability distribution over possible functions.
  - Crucially, for any given hyperparameter set, the GP provides a mean prediction of the performance and an uncertainty estimate. High uncertainty exists in regions of the space that have not yet been explored.
2. Acquisition Function: This function uses the predictions and uncertainty from the surrogate model to decide which point to evaluate next. It balances two competing goals:
  - Exploitation: Choosing points where the surrogate model predicts high performance (i.e., low loss). This focuses on refining the search in known good areas.
  - Exploration: Choosing points where the uncertainty is high. This probes unknown regions of the space where an even better performance might be hiding.
  - A popular acquisition function is Expected Improvement (EI), which quantifies how much better we can expect a point to be compared to the best point found so far.
Algorithm Flow:
1. Select and evaluate a few initial hyperparameter sets (e.g., randomly).
2. Fit the Gaussian Process surrogate model to the evaluated points.
3. Optimize the acquisition function (which is cheap to do) to find the next most promising hyperparameter set λ_next.
4. Evaluate the true objective function at λ_next (i.e., train the ML model and get its validation score).
5. Update the surrogate model with the new (λ_next, performance) data point.
6. Repeat steps 3-5 for a set number of iterations (the budget).
Pros:
- Sample Efficiency: Finds high-performing hyperparameter sets in far fewer evaluations than Grid or Random Search. This is its main advantage.
- Excellent for optimizing models that are very expensive to train (e.g., deep neural networks).
Cons:
- The overhead of fitting and optimizing the surrogate model can be significant.
- Performance can degrade in very high-dimensional spaces (>20 dimensions).
- Can be conceptually more complex to implement and understand.

4. Optimization for Ensemble Learning

Ensemble learning combines multiple models to achieve better predictive performance. Optimization plays a role both within the individual models and in how they are combined.

4.1. Optimizing Base Learners

Ensemble methods like Random Forest, Gradient Boosting (XGBoost, LightGBM), and AdaBoost are built from many "base learners" (e.g., decision trees).
These base learners, and the ensemble method itself, have hyperparameters that can be tuned. For example, in a Gradient Boosting model, one might tune n_estimators, max_depth, learning_rate, and subsample.
Any HPO technique (Random Search, Bayesian Optimization, etc.) can be applied to find the optimal configuration for the entire ensemble. This can be very computationally expensive as each evaluation requires training the full ensemble.

4.2. Optimizing the Ensemble Combination: Stacking

Stacking (or Stacked Generalization) is an advanced ensemble technique where optimization is central to how the models are combined.

Concept: Instead of using a simple function (like averaging) to combine predictions from base models, stacking uses another machine learning model—a meta-learner—to learn the best way to combine them.
The Optimization Problem in Stacking: The meta-learner solves the optimization problem of finding the optimal weights or function to combine the base model predictions to minimize the overall error.
Process:
1. Split Data: The training data is split into K-folds (e.g., 5 folds).
2. Train Level-0 Models:
  - For each fold k from 1 to K:
    - Train a set of diverse base models (e.g., a Random Forest, an SVM, a k-NN) on the other K-1 folds.
    - Make predictions on the hold-out fold k.
  - After iterating through all K folds, you will have "out-of-fold" (OOF) predictions for the entire training dataset. These OOF predictions are crucial to prevent data leakage.
3. Create Meta-Dataset: The OOF predictions from the base models become the input features for the meta-learner. The original target variable remains the target.
  - Meta-features = [RF_preds, SVM_preds, kNN_preds]
  - Meta-target = original_target
4. Train Level-1 Model (Meta-Learner): Train the meta-learner (e.g., a simple Logistic Regression or a powerful Gradient Boosting model) on this new meta-dataset.
5. Final Prediction: To predict on new test data, first get predictions from all the base models (which are retrained on the full original training set). Then, feed these predictions as input to the trained meta-learner to get the final output.

5. Introduction to Automated Machine Learning (AutoML)

AutoML aims to automate the complex, iterative, and time-consuming tasks in a machine learning pipeline, making ML technology more accessible and efficient.

5.1. What is AutoML?

Definition: AutoML is the process of automating the end-to-end pipeline of applying machine learning to real-world problems. It automates everything from data preparation to model deployment.
Goals:
- Democratization: Enable domain experts who are not ML specialists to use ML models.
- Efficiency: Automate repetitive tasks to allow ML experts to focus on more complex problems.
- Performance: Systematically explore a vast space of models and hyperparameters to often achieve better performance than manually designed models.

5.2. The AutoML Pipeline

An ideal AutoML system automates the following stages:

Automated Data Preparation: Handling missing values, outlier detection, data type conversion.
Automated Feature Engineering: Automatically creating and selecting features. This includes scaling numerical features, encoding categorical features, and generating interaction or polynomial features.
Model Selection (Algorithm Selection): Choosing the best algorithm (e.g., Logistic Regression, XGBoost, SVM, Neural Network) for the given dataset and task.
Hyperparameter Optimization: Tuning the hyperparameters of the selected algorithm and preprocessing steps.

5.3. CASH: Combined Algorithm Selection and Hyperparameter Optimization

This is the core challenge that many modern AutoML systems aim to solve.

Problem: The search space is not just the hyperparameters of a single model but a complex, hierarchical space that includes:
- The choice of data preprocessing algorithm(s).
- The hyperparameters of the chosen preprocessor(s).
- The choice of machine learning model.
- The hyperparameters of the chosen model.
Solution: Techniques like Bayesian Optimization and Evolutionary Algorithms are extended to handle this complex, conditional search space, allowing the system to simultaneously find the best pipeline and its corresponding configuration.

5.4. Popular AutoML Frameworks

Auto-sklearn: Built on scikit-learn, it uses Bayesian Optimization to solve the CASH problem.
TPOT (Tree-based Pipeline Optimization Tool): Uses Genetic Programming (a type of evolutionary algorithm) to evolve entire ML pipelines as expression trees.
H2O AutoML: An open-source, distributed, and scalable platform that automates training and tuning for a wide range of models.
Google Cloud AutoML / Azure Automated ML: Commercial cloud services that provide user-friendly interfaces for building high-quality custom models.

Unit 4

Unit 6