Unit 2 - Notes

CSE275 16 min read

Unit 2: Evolutionary Algorithms

1. Introduction to Evolutionary Computation (EC)

Evolutionary Computation is a subfield of artificial intelligence and computational intelligence that involves combinatorial optimization and numerical optimization problems. EC uses iterative progress, such as growth or development in a population, inspired by biological evolution.

Core Principles:
EC algorithms are population-based metaheuristic optimization algorithms. The core idea is based on Darwin's theory of evolution and natural selection: "survival of the fittest".

  1. Population: A set of potential solutions (called individuals or chromosomes) to the problem is maintained.
  2. Fitness: Each individual is evaluated based on a fitness function, which quantifies how good a solution it is.
  3. Selection: Individuals with higher fitness are more likely to be selected to "reproduce" and pass their traits to the next generation.
  4. Reproduction (Variation): New individuals (offspring) are created by combining or modifying existing individuals (parents) using operators like crossover and mutation.
  5. Iteration: This process of evaluation, selection, and reproduction is repeated over many generations, with the expectation that the overall fitness of the population will improve over time, converging towards an optimal solution.

Key Characteristics of EC:

  • Stochastic: They incorporate random elements in initialization, selection, and variation operators.
  • Population-Based: They work with a pool of solutions, making them less likely to get stuck in local optima compared to single-point search methods (like hill climbing).
  • Heuristic: They do not guarantee finding the absolute global optimum but are excellent for finding very good, near-optimal solutions for complex problems.
  • Problem-Agnostic: The core algorithm is separate from the problem-specific representation and fitness function, making them adaptable to a wide range of problems.

Major Paradigms of EC:

  • Genetic Algorithms (GAs): The most popular EC technique, typically using binary or string-based representations.
  • Genetic Programming (GP): Evolves computer programs, often represented as trees.
  • Evolution Strategies (ES): Focuses on real-valued vector optimization, with a strong emphasis on self-adapting mutation rates.
  • Evolutionary Programming (EP): Similar to ES but historically focused on evolving finite state machines.

2. Genetic Algorithms (GAs)

A Genetic Algorithm is a search heuristic that mimics the process of natural selection. It is commonly used to generate high-quality solutions to optimization and search problems.

Standard GA Workflow:

  1. Initialization: Create an initial population of N individuals, usually randomly. Each individual represents a potential solution.
  2. Evaluation: Calculate the fitness of each individual in the population using the fitness function.
  3. Selection: Select parent individuals from the current population based on their fitness. Fitter individuals have a higher chance of being selected.
  4. Crossover (Recombination): Create offspring by combining the genetic material of selected parents.
  5. Mutation: Apply small, random changes to the offspring's genetic material to introduce new information and maintain diversity.
  6. Replacement: Form a new generation by replacing some or all of the old population with the new offspring.
  7. Termination: Stop the algorithm if a termination condition is met (e.g., a maximum number of generations, a satisfactory fitness level is reached, or the population stops improving). If not, return to Step 2.

Pseudocode for a Standard Genetic Algorithm:

TEXT
function GeneticAlgorithm(problem_definition):
    // Step 1: Initialization
    population = initialize_population(population_size)

    for generation in 1 to max_generations:
        // Step 2: Evaluation
        fitness_scores = evaluate_fitness(population, problem_definition)

        // Termination Check
        if termination_condition_met(fitness_scores):
            break

        // Step 3: Selection
        parents = select_parents(population, fitness_scores)

        // Step 4 & 5: Crossover and Mutation
        offspring_population = []
        while len(offspring_population) < population_size:
            parent1, parent2 = choose_from(parents)
            if random() < crossover_rate:
                child = crossover(parent1, parent2)
            else:
                child = copy(parent1) // Asexual reproduction

            if random() < mutation_rate:
                child = mutate(child)
            
            add child to offspring_population

        // Step 6: Replacement
        population = offspring_population

    best_solution = find_best_individual(population, fitness_scores)
    return best_solution

2.1 Representation (Encoding)

The first step in applying a GA is to represent a solution as a chromosome. The choice of representation is crucial and problem-dependent.

  • Binary Representation:

    • The most traditional representation. Each chromosome is a string of bits (0s and 1s).
    • Example: For the knapsack problem, a binary string [1, 0, 1, 1, 0] could mean "include item 1, exclude item 2, include item 3, include item 4, exclude item 5".
    • Pros: Simple, allows for standard crossover/mutation operators.
    • Cons: Can be unnatural for some problems (e.g., real-valued parameters), and Hamming cliffs can be an issue.
  • Real-Valued (or Floating-Point) Representation:

    • Each gene is a real number. The chromosome is a vector of real values.
    • Example: For tuning hyperparameters of a neural network, a chromosome [0.001, 0.9, 128] could represent learning rate, momentum, and batch size.
    • Pros: Natural for continuous parameter optimization, more precise.
    • Cons: Requires different genetic operators (e.g., arithmetic crossover, Gaussian mutation).
  • Integer Representation:

    • Each gene is an integer. Used when parameters are discrete.
    • Example: A chromosome [3, 5, 1, 4, 2] could represent a specific configuration or choice from a set of options.
  • Permutation Representation:

    • Used for ordering or sequencing problems. The chromosome is a permutation of a set of elements.
    • Example: For the Traveling Salesperson Problem (TSP), a chromosome [C, A, D, B, E] represents the order in which cities are visited.
    • Pros: Directly encodes valid solutions for ordering problems.
    • Cons: Requires specialized operators (e.g., Order Crossover, Swap Mutation) to ensure offspring are also valid permutations.

2.2 Fitness Function

The fitness function is the bridge between the GA and the problem to be solved. It takes an individual (a solution) as input and returns a single numerical score, its fitness. This score guides the selection process.

  • Role: It quantifies the quality of a solution. For an optimization problem, this is typically the value of the objective function.
  • Maximization vs. Minimization: GAs are naturally maximizers. To solve a minimization problem (e.g., minimizing error), you can:
    1. Invert the objective function: fitness = 1 / error (careful with division by zero).
    2. Subtract from a large constant: fitness = C_max - error.
  • Properties of a good fitness function:
    • Accurate: Must correctly measure the quality of the solution according to the problem's goals.
    • Efficient: It is called once for every individual in every generation, so it must be computationally fast. A slow fitness function will make the entire GA slow.
    • Discriminating: It should provide a gradient of quality, allowing the GA to distinguish between good and excellent solutions.

2.3 Selection Strategies

Selection mechanisms determine which individuals from the population get to become parents. The goal is to favor fitter individuals while still giving less fit individuals a chance to reproduce, maintaining diversity. The degree to which fitter individuals are favored is called selection pressure.

  • Fitness Proportionate Selection (Roulette Wheel):

    • Mechanism: Each individual is assigned a slice of a "roulette wheel" proportional to its fitness. The wheel is spun N times to select N parents.
    • Formula: P(selection_i) = fitness_i / sum(all_fitnesses)
    • Pros: Simple and intuitive.
    • Cons:
      • Can lead to premature convergence if one individual (a "superstar") has a very high fitness and dominates the selection process.
      • Struggles with scaling when fitness values are very close to each other.
      • Cannot be used for minimization problems directly (requires fitness transformation).
  • Tournament Selection:

    • Mechanism: A "tournament" of k individuals is randomly chosen from the population. The individual with the highest fitness within this tournament group is selected as a parent. This process is repeated to select all parents.
    • k (Tournament Size): This parameter controls selection pressure. A larger k means higher pressure (only the very best in larger groups win). A common value for k is 2.
    • Pros:
      • Efficient to implement.
      • Adjustable selection pressure via k.
      • Does not require fitness scaling.
    • Cons: Higher k can reduce population diversity quickly.
  • Rank Selection:

    • Mechanism: Individuals are first sorted by fitness. Selection probability is then based on their rank, not their raw fitness score. The best individual gets rank 1, second best gets rank 2, and so on.
    • Pros:
      • Avoids the "superstar" problem of roulette wheel selection.
      • Prevents premature convergence by capping the selection probability of the best individuals.
    • Cons: Requires sorting the population, which adds computational overhead (O(N log N)).
  • Elitism:

    • Mechanism: Not a standalone selection method, but a modification. The best m individuals from the current generation are guaranteed to be copied, unchanged, to the next generation.
    • Purpose: Ensures that the best-found solution is never lost due to stochastic effects of selection, crossover, or mutation.
    • Benefit: Almost always improves performance and guarantees monotonic improvement in the best fitness across generations.

2.4 Crossover and Mutation Operators

These are the variation operators that create new solutions from existing ones.

Crossover (Recombination):
Combines genetic information from two parents to create one or more offspring. It is the primary mechanism for exploration and exploitation of the search space. Controlled by a crossover rate (e.g., pc = 0.8), the probability that two selected parents will undergo crossover.

  • For Binary/Integer Representation:

    • Single-Point Crossover: A single crossover point is chosen. Material before the point is taken from parent 1, and material after is from parent 2.
      TEXT
              Parent 1: [1 1 0 0 | 1 0 1 1]
              Parent 2: [0 1 1 0 | 0 1 0 1]
              -----------------------------
              Child 1:  [1 1 0 0 | 0 1 0 1]
              Child 2:  [0 1 1 0 | 1 0 1 1]
              
    • Two-Point Crossover: Two points are chosen. Material between the points is swapped.
    • Uniform Crossover: For each gene, a random choice is made whether to take the gene from parent 1 or parent 2.
  • For Real-Valued Representation:

    • Arithmetic Crossover / Blend Crossover (BLX-α): Creates an offspring gene as a weighted average of the parent genes.
      child_gene = α * parent1_gene + (1-α) * parent2_gene
  • For Permutation Representation:

    • Order Crossover (OX): A subsequence from one parent is copied to the child. The remaining genes are filled in from the other parent in the order they appear.
    • Partially Mapped Crossover (PMX): Similar to OX but preserves the absolute positions of genes more effectively.

Mutation:
Introduces small, random changes to an individual's chromosome. Its primary role is to maintain genetic diversity in the population and prevent premature convergence by introducing new genetic material, allowing the algorithm to escape local optima. Controlled by a mutation rate (e.g., pm = 0.01), which is typically very low.

  • For Binary Representation:

    • Bit-Flip Mutation: A randomly chosen bit is flipped from 0 to 1 or 1 to 0.
  • For Real-Valued Representation:

    • Gaussian Mutation: A small random value from a Gaussian distribution (mean 0) is added to a selected gene. This allows for fine-tuning solutions.
    • Random Resetting: A gene is replaced with a new random value within its allowed range.
  • For Permutation Representation:

    • Swap Mutation: Two randomly chosen genes are swapped.
    • Scramble Mutation: A random subset of genes is chosen and their positions are scrambled.

3. Convergence Behaviour

Convergence in a GA refers to the state where the population's individuals become very similar or identical, leading to little or no further improvement in fitness over generations. Ideally, the population converges to the region of the global optimum.

A typical convergence plot shows the best and average fitness of the population over generations.

  • Initial phase: Rapid improvement as the GA quickly finds better solutions.
  • Mid-phase: Slower, steady improvement as the GA fine-tunes solutions.
  • Final phase: A plateau, where fitness improves very little or not at all. This indicates convergence.

3.1 Premature Convergence and Diversity Preservation

Premature Convergence is a major problem in GAs. It occurs when the population converges to a local optimum rather than the global optimum. The algorithm gets "stuck" because the population has lost the genetic diversity needed to explore other parts of the search space.

Causes of Premature Convergence:

  1. High Selection Pressure: Overly aggressive selection (e.g., high tournament size) can cause a few good-but-not-optimal individuals to quickly dominate the population, wiping out other potentially useful genetic material.
  2. Poor Population Diversity: If the initial population is not diverse enough, the search starts in a narrow region of the space.
  3. Low Mutation Rate: Insufficient mutation prevents the introduction of new genetic material needed to escape local optima.

Diversity Preservation Techniques:
These methods are used to combat premature convergence by ensuring the population remains diverse.

  • Increasing Mutation Rate: A simple approach is to use a higher mutation rate. Some GAs use an adaptive mutation rate, which increases when the population diversity drops below a threshold.
  • Niching: Methods that encourage the formation of stable subpopulations (niches) around different peaks in the fitness landscape.
    • Fitness Sharing: An individual's raw fitness is reduced based on how many other similar individuals are in the population. This penalizes crowding and rewards individuals that explore unique regions.
  • Crowding: An offspring replaces the most similar individual in the population (from a random subset) instead of a randomly or poorly performing one. This helps preserve different species of solutions.
  • Island Model (Multi-Population GA): The population is divided into several subpopulations (islands) that evolve independently for a number of generations. Periodically, a few individuals migrate between islands. This allows different islands to explore different parts of the search space and share promising solutions.

4. Fitness Landscape Intuition and Search Difficulty

The fitness landscape is a conceptual model used to visualize the search space of an optimization problem. It's an n-dimensional space where:

  • Each point represents a possible solution (chromosome).
  • The "height" of the point is its fitness value.

The goal of the GA is to find the highest peak (global optimum) in this landscape.

Characteristics of Fitness Landscapes and Their Impact on GA Difficulty:

  • Unimodal vs. Multimodal:

    • Unimodal Landscape (e.g., a single large hill): Has only one peak (the global optimum). Easy for most optimization algorithms, including simple hill climbers.
    • Multimodal Landscape (e.g., a mountain range): Has many peaks (local optima) and one highest peak (global optimum). This is where GAs excel, as their population-based approach allows them to explore multiple peaks simultaneously and avoid getting permanently stuck on a smaller one.
  • Ruggedness (Smooth vs. Rugged):

    • Smooth Landscape: Fitness of nearby solutions is highly correlated. Small changes in the solution lead to small changes in fitness. GAs can navigate these landscapes effectively.
    • Rugged Landscape: Fitness of nearby solutions is not correlated. Small changes can lead to large, unpredictable jumps in fitness. These are very difficult to search as the landscape provides little guidance.
  • Deceptive Landscapes:

    • These are the most difficult landscapes for a GA. They contain "deceptive attractors" or local optima that have large basins of attraction and appear promising.
    • The building blocks (schemas) that lead to these local optima are the opposite of those that lead to the global optimum. The GA is actively misled away from the true best solution.
  • Needle-in-a-Haystack:

    • A completely flat landscape with one single, sharp peak (the "needle"). The fitness function provides no information or gradient to guide the search until the exact solution is found. This is equivalent to a random search.

How GAs Navigate the Landscape:

  • Crossover: Allows for large jumps across the landscape, combining features of two different "hills". This is a key mechanism for exploration.
  • Mutation: Allows for small, local steps around a point in the landscape. This is a key mechanism for local exploitation and for escaping a local peak.
  • Selection: Pushes the population up the hills discovered so far.

5. Applications of Genetic Algorithms in Machine Learning

GAs are powerful optimization tools that can be applied to many hard optimization problems within machine learning.

  • Hyperparameter Tuning:

    • Problem: Finding the optimal set of hyperparameters for a machine learning model (e.g., learning rate, number of layers, number of neurons, regularization strength).
    • GA Application:
      • Representation: A chromosome represents a set of hyperparameters (e.g., [0.01, 3, 256, 0.5] for learning rate, number of hidden layers, neurons per layer, dropout rate).
      • Fitness Function: The model is trained with the given hyperparameters, and its performance (e.g., accuracy, F1-score) on a validation set is the fitness. Lower validation error corresponds to higher fitness.
  • Neural Architecture Search (NAS):

    • Problem: Automatically designing the architecture of a neural network (e.g., which layers to use, how to connect them).
    • GA Application:
      • Representation: The chromosome encodes the structure of the network. This can be complex, e.g., a string of bits where each block of bits defines a layer's type, parameters, and connections.
      • Fitness Function: Train the generated network architecture and evaluate its performance on a validation set. This is computationally very expensive.
  • Rule Discovery and Learning Classifier Systems (LCS):

    • Problem: Discovering a set of IF-THEN rules from data to build a classifier.
    • GA Application:
      • Representation: A chromosome can represent a single rule (IF [condition] THEN [class]) or a set of rules.
      • Fitness Function: The fitness of a rule set is its classification accuracy on the training data.

6. Feature Selection using Genetic Algorithms

Feature selection is the process of selecting a subset of relevant features for use in model construction. It helps to reduce overfitting, improve model performance, and reduce training time. This is a combinatorial optimization problem, making it a perfect fit for GAs.

Goal: Find the subset of features that results in the best predictive performance with the fewest number of features.

How to Apply a GA for Feature Selection:

  1. Representation:

    • A binary string is the most common representation. The length of the string is equal to the total number of features available (d).
    • Each bit i in the chromosome corresponds to the i-th feature.
    • 1 means the feature is selected.
    • 0 means the feature is not selected.
    • Example: For 8 features, the chromosome [1, 0, 1, 1, 0, 0, 1, 0] means that features 1, 3, 4, and 7 are selected for training the model.
  2. Fitness Function:

    • The fitness function must balance two competing objectives:
      1. Maximize Model Performance: The primary goal.
      2. Minimize the Number of Features: The secondary goal, to promote simpler models (Occam's razor).
    • A common approach is to use a weighted formula:
      Fitness = w * (Model_Accuracy) - (1 - w) * (Number_of_Selected_Features / Total_Features)
      • Model_Accuracy is the performance of a chosen classifier (e.g., SVM, Decision Tree, Logistic Regression) trained and evaluated (using cross-validation) on only the subset of features indicated by the chromosome.
      • w is a weight parameter (e.g., w = 0.8) that controls the trade-off between accuracy and model simplicity. A higher w prioritizes accuracy.
  3. GA Process:

    • Initialization: Create a random population of binary chromosomes. This means starting with a random collection of feature subsets.
    • Evaluation: For each chromosome in the population:
      a. Identify the feature subset based on the 1s in the string.
      b. Train a machine learning model using only these features.
      c. Evaluate the model on a validation set.
      d. Calculate the fitness score using the formula above.
    • Selection, Crossover, Mutation: Apply standard GA operators.
      • Selection: Fitter chromosomes (those yielding better model performance with fewer features) are more likely to be selected as parents.
      • Crossover: Combines two parent feature subsets to create new ones.
      • Mutation (Bit-flip): Randomly flips a bit, which corresponds to adding or removing a single feature from the subset.
    • Termination: The algorithm runs for a set number of generations. The best chromosome found during the entire run represents the near-optimal feature subset.

Benefits of using GAs for Feature Selection:

  • Can explore a large space of possible feature subsets efficiently.
  • Less prone to getting stuck in local optima compared to greedy methods like forward or backward selection.
  • Can capture complex interactions between features, as it evaluates the fitness of a subset as a whole, not just individual features.