Unit 6 - Notes

CSE275 11 min read

Unit 6: Hybrid Optimization and Applications

1. Hybrid Optimization Models

1.1. Introduction and Motivation

Hybrid optimization involves combining two or more optimization algorithms to create a new, more powerful method. The primary motivation stems from the No Free Lunch (NFL) Theorem for optimization, which states that no single algorithm is universally best for all types of problems.

Core Idea: To leverage the strengths of different algorithms while mitigating their respective weaknesses. A common strategy is to combine a global search algorithm (strong in exploration) with a local search algorithm (strong in exploitation).

Exploration: The ability to search broadly across the entire feasible region to find promising areas and avoid getting stuck in local optima. (e.g., Genetic Algorithms, Particle Swarm Optimization).
Exploitation: The ability to finely search within a promising region to find the precise local optimum. (e.g., Hill Climbing, Gradient Descent, Nelder-Mead).

1.2. Taxonomy of Hybrid Models

Hybridization can be categorized based on how the constituent algorithms are combined.

A. Based on the Level of Integration:

High-Level Hybridization (Relay or Teamwork):
- The constituent algorithms are self-contained and act as black boxes. They exchange information, but their internal workings are not mixed.
- Sequential (Relay): One algorithm runs to completion, and its output is used as the input for the next.
  - Example: Use a Genetic Algorithm (GA) to find a good initial solution, then use Gradient Descent to refine it to the nearest local optimum.
- Parallel (Teamwork): Algorithms run concurrently and share information during their execution.
  - Example: Running multiple optimization algorithms in parallel and periodically sharing the best-found solutions among them.
Low-Level Hybridization (Embedded):
- One algorithm's functionality is deeply integrated into the inner workings of another. A component of one algorithm is replaced by another complete algorithm or its operators.
- Example: In a GA, replacing the standard mutation operator with a local search heuristic. An individual is mutated by running a few steps of Hill Climbing on it. This type of hybrid is often called a Memetic Algorithm (MA).

B. Based on the Type of Algorithms Combined:

Homogeneous Hybridization: Combining different variants of the same base algorithm.
- Example: An "island model" GA where different sub-populations use different crossover rates or selection mechanisms.
Heterogeneous Hybridization: Combining fundamentally different algorithmic approaches.
- Example: Combining a swarm-based method (PSO) with an evolutionary method (GA), or a metaheuristic with a deterministic mathematical programming technique.

1.3. Common Hybridization Strategies

Global Exploration + Local Exploitation (Memetic Algorithms):

Concept: Use a population-based global optimizer (like GA, DE, PSO) to identify promising regions of the search space. Then, apply a local search method to each individual (or the best ones) to quickly find the local optimum in that region.

Benefit: Achieves a better balance between exploration and exploitation, often leading to faster convergence to high-quality solutions.

PYTHON

    # Pseudocode for a simple Memetic Algorithm
    population = initialize_population()
    for generation in range(max_generations):
        # Apply local search to some/all individuals
        for individual in population:
            individual = local_search_heuristic(individual)

        # Evaluate fitness of the improved population
        evaluate_fitness(population)

        # Standard evolutionary steps
        parents = select_parents(population)
        offspring = crossover(parents)
        offspring = mutate(offspring)

        population = replace_population(population, offspring)
    
    return best_individual_in(population)

Seeding/Initialization: Use a simpler, faster heuristic to generate a high-quality initial population for a more complex metaheuristic. This gives the main algorithm a better starting point.
Parameter Control: Use one optimization algorithm to tune the hyperparameters of another. For example, using a metaheuristic to find the optimal learning rate and momentum for a gradient-based optimizer in a neural network.

2. Combining Evolutionary and Swarm-Based Approaches

This is a popular form of heterogeneous hybridization, aiming to combine the powerful recombination operators of Evolutionary Algorithms (EAs) with the rapid information-sharing mechanisms of Swarm Intelligence (SI).

2.1. Genetic Algorithm + Particle Swarm Optimization (GA-PSO)

Motivation: GA excels at maintaining population diversity through crossover and mutation. PSO excels at fast convergence by having particles learn from both their own experience (pbest) and the swarm's collective experience (gbest). The hybrid aims to prevent PSO's premature convergence while accelerating GA's search.
Hybridization Strategies:
1. Operator-based Hybrid: Introduce GA operators into the PSO algorithm. After the standard PSO velocity and position updates, apply crossover and mutation to a subset of the particles. This re-introduces diversity and can help the swarm escape local optima.
2. Sequential Hybrid (GA for Initialization): Run a GA for a few generations to generate a diverse, high-fitness population. Use this population as the initial swarm for PSO. This prevents the initial random PSO swarm from collapsing to a poor local optimum too quickly.
3. Co-evolutionary Hybrid: Maintain two populations, one evolving with GA rules and the other with PSO rules. Periodically, individuals can migrate between the populations, allowing for the exchange of beneficial genetic material and swarm intelligence.

2.2. Differential Evolution + Ant Colony Optimization (DE-ACO)

Context: Useful for mixed (continuous and discrete) optimization problems. DE is powerful for continuous problems, while ACO is designed for discrete/combinatorial problems.
Application Example: Optimizing a system with both continuous parameters and a discrete structural component (e.g., designing a communication network where node positions are continuous, but the connection topology is discrete).
Hybridization Strategy (High-Level, Co-evolutionary):
1. The DE algorithm optimizes the continuous parameters of the problem.
2. The ACO algorithm optimizes the discrete/combinatorial part.
3. To evaluate a DE individual (a set of continuous parameters), a full ACO run is performed using those parameters to find the best corresponding discrete structure. The result of the ACO run becomes the fitness of the DE individual.
4. Conversely, the pheromone trails in ACO can be biased by the quality of solutions found by the DE part.

3. Real-world Case Studies in Optimization-based ML Systems

3.1. Case Study: Hyperparameter Tuning of Deep Neural Networks

Problem: The performance of a Deep Neural Network (DNN) is highly sensitive to its hyperparameters (e.g., learning rate, number of layers, layer size, dropout rate, optimizer choice). This creates a vast, complex, and non-differentiable search space. Evaluating a single configuration is computationally expensive as it requires training the network.
Hybrid Approach: Bayesian Optimization with Evolutionary Seeding.
- Phase 1 (Global Exploration): A Genetic Algorithm (GA) or another evolutionary strategy is run for a small number of generations. This quickly explores a wide range of diverse hyperparameter combinations (handling discrete, categorical, and continuous values naturally).
- Phase 2 (Informed Exploitation): The top-performing individuals from the GA are used as the initial points for Bayesian Optimization (BO). BO builds a probabilistic surrogate model (e.g., a Gaussian Process) of the hyperparameter-performance landscape. It then uses an acquisition function (e.g., Expected Improvement) to intelligently select the next hyperparameter set to evaluate, focusing on regions that are likely to yield improvements.
Why it's effective: The GA provides a "warm start" for BO, preventing it from wasting time exploring poor regions of the search space. BO then efficiently refines the search in the promising areas identified by the GA.

3.2. Case Study: Feature Selection in Medical Diagnosis

Problem: In genomics or medical imaging, datasets can have tens of thousands of features (genes, pixels). Selecting a minimal subset of features that retains maximum predictive power is crucial for building accurate, interpretable, and computationally efficient diagnostic models. This is an NP-hard combinatorial problem.
Hybrid Approach: Binary PSO with a Local Search Wrapper (Memetic Approach).
- PSO Part (Global Search): A binary version of Particle Swarm Optimization is used. Each particle is a binary vector where 1 indicates a selected feature and 0 indicates a non-selected feature. The swarm explores the space of all possible feature subsets. The fitness function rewards high classification accuracy (e.g., on a validation set) and penalizes a large number of selected features to encourage sparsity.
- Local Search Part (Fine-tuning): For the best particle in each generation, a simple greedy heuristic like Sequential Backward Selection (SBS) is applied for a few steps. SBS iteratively removes features from the current subset, one at a time, keeping the removal only if it doesn't significantly degrade classifier performance.
Why it's effective: PSO efficiently explores the global landscape of feature combinations. The SBS wrapper performs local exploitation, refining the best-found subsets by pruning redundant or irrelevant features that the coarse-grained PSO might have missed.

4. Performance Evaluation of Optimization Techniques

Evaluating stochastic optimization algorithms requires a rigorous and standardized methodology to ensure that conclusions are statistically significant and not due to random chance.

4.1. Key Metrics

Solution Quality (Accuracy):
- For benchmark problems with known optima, the error (difference between found solution and true optimum) is used.
- For real-world problems, the best objective function value found is the primary metric.
- Statistical Measures: Over multiple runs (e.g., 30+), report the Best, Worst, Mean, Median, and Standard Deviation of the final solutions. A small standard deviation indicates high robustness.
Convergence Speed:
- Number of Function Evaluations (NFE): The most common and hardware-independent measure of computational effort.
- Convergence Plots: A graph plotting the mean best-so-far fitness against the NFE. This visualizes both the speed and quality of the search process.
Success Rate (SR):
- The percentage of runs in which an algorithm successfully finds a solution that meets a predefined quality threshold (e.g., within 1e-6 of the known global optimum).
Computational Time:
- Average CPU time per run. While hardware-dependent, it's a practical measure of the algorithm's real-world cost.

4.2. Methodology

Benchmark Suite: Use a diverse set of standard benchmark functions (e.g., Sphere, Rastrigin, Ackley) that test different aspects of an algorithm's performance (unimodality, multi-modality, separability, scalability).
Multiple Runs: Always perform multiple independent runs (typically 30 or more) for each algorithm on each problem to account for stochasticity.
Statistical Significance Testing:
- Performance differences observed in mean or median values might be due to chance. Statistical tests are crucial.
- Because the distribution of results from metaheuristics is often not normal, non-parametric tests are preferred.
- For comparing two algorithms: Use the Wilcoxon rank-sum test.
- For comparing three or more algorithms: Use the Friedman test followed by post-hoc tests (like Nemenyi or Bonferroni-Dunn) to identify which pairs of algorithms have significantly different performance.

5. Computational Considerations in Large-Scale Optimization

Applying optimization techniques to large-scale ML problems (e.g., big data, high-dimensional models) introduces significant computational challenges.

5.1. Core Challenges

Curse of Dimensionality: As the number of variables (dimensions) increases, the search space volume grows exponentially. A fixed number of samples becomes increasingly sparse, making it extremely difficult for any algorithm to explore the space effectively.
Expensive Objective Function Evaluations: In ML, evaluating the objective function often means training and validating a model, which can take hours or days for large datasets and complex architectures (e.g., large language models). This makes traditional population-based methods that require thousands of evaluations infeasible.
Massive Datasets: The cost of a single function evaluation scales with the size of the training data. Algorithms that need to process the entire dataset for each evaluation are impractical.

5.2. Strategies and Techniques

Parallel and Distributed Computing:
- Master-Slave Parallelism: A central master process manages the algorithm's logic and distributes fitness evaluations to multiple worker nodes. This is straightforward to implement for population-based methods and dramatically reduces wall-clock time.
- Island Model (Distributed GA): The population is divided into sub-populations (islands) that evolve independently on different machines. Periodically, a few individuals "migrate" between islands. This not only parallelizes the workload but can also improve search performance by maintaining diversity.
Surrogate-Assisted Optimization:
- Concept: When the true objective function is too expensive, build a cheap, approximate model of it called a surrogate model (or meta-model).
- Process:
  1. Evaluate a small number of initial points on the true, expensive function.
  2. Train a surrogate model (e.g., a Gaussian Process, Random Forest) on these results.
  3. Run the optimization algorithm for many iterations on the cheap surrogate to find a promising new candidate solution.
  4. Evaluate this candidate on the true function.
  5. Add the new result to the training set and update the surrogate model. Repeat.
- Benefit: Drastically reduces the number of expensive function evaluations required. This is the core idea behind Bayesian Optimization.
Decomposition (Cooperative Co-evolution):
- Concept: For high-dimensional problems, "divide and conquer". The D-dimensional vector of variables is split into k smaller sub-vectors.
- Process: A separate sub-population is used to optimize each sub-vector. To evaluate an individual from one sub-population, it must be combined with representative individuals from all other sub-populations to form a complete solution.
- Benefit: Effective for problems where variables are loosely coupled (nearly separable). It transforms a single hard, high-dimensional problem into several easier, low-dimensional ones.
Stochastic and Mini-batch Methods:
- Context: The dominant approach for training large-scale deep learning models.
- Concept: Instead of calculating the true gradient on the entire dataset (which is prohibitively expensive), Stochastic Gradient Descent (SGD) and its variants (Adam, RMSprop) estimate the gradient using a small, random subset of the data called a mini-batch.
- Benefit: Each step is computationally cheap and, while noisy, provides a statistically unbiased estimate of the true gradient. This allows the model to make rapid progress, converging much faster in practice for large, redundant datasets.

Unit 5