Unit 4 - Notes

INT255 3 min read

Unit 4: Optimization Techniques for Machine Learning

1. Optimization Problem Formulation in Machine Learning

At its core, training a machine learning model is an optimization problem. We aim to find the set of model parameters (or weights) that minimizes a specific objective function, often called the loss function or cost function.

General Formulation

A standard optimization problem is expressed as:

TEXT

minimize  f(θ)
subject to θ ∈ S

f(θ): The objective function we want to minimize. In ML, this is the loss function that measures the discrepancy between the model's predictions and the true labels.
θ: The set of parameters (or weights and biases) of the model. These are the variables we are tuning. For a neural network, θ can be a vector containing millions or even billions of values.
S: The feasible set or constraint set. In many ML problems, this is unconstrained (i.e., θ can be any real-valued vector), but constraints can be added, for instance, through regularization.

Empirical Risk Minimization (ERM)

The primary framework for formulating the objective function in supervised learning is Empirical Risk Minimization. The "risk" is the expected loss over the true data distribution, which is unknown. We approximate this by minimizing the "empirical risk," which is the average loss over our training dataset.

Given a training dataset of N samples {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}:

Model Prediction: ŷᵢ = h(xᵢ; θ), where h is our model (e.g., a neural network).
Loss Function per sample: L(ŷᵢ, yᵢ) measures the error for a single sample.
Objective Function J(θ): The average loss over the entire training set.

J(θ) = (1/N) * Σᵢ<binary data, 13 bytes>₁ⁿ L(h(xᵢ; θ), yᵢ)

Common Loss Functions L:

Mean Squared Error (MSE) for Regression: L(ŷ, y) = (ŷ - y)²
Cross-Entropy Loss for Classification: L(ŷ, y) = -[y * log(ŷ) + (1-y) * log(1-ŷ)] (for binary classification).

Regularization

To prevent overfitting, where the model learns the training data too well and fails to generalize to new data, a regularization term R(θ) is often added to the objective function. This term penalizes complex models (e.g., models with large parameter values).

The new objective function becomes:

J(θ) = (1/N) * Σᵢ<binary data, 13 bytes>₁ⁿ L(h(xᵢ; θ), yᵢ) + λ * R(θ)

λ: The regularization hyperparameter, which controls the strength of the penalty.

Common Regularization Terms R(θ):

L2 Regularization (Ridge): R(θ) = ||θ||₂² = Σⱼ θⱼ². It penalizes the squared magnitude of the parameters, encouraging smaller, more diffuse weights.
L1 Regularization (Lasso): R(θ) = ||θ||₁ = Σⱼ |θⱼ|. It penalizes the absolute value of the parameters, which can lead to sparse models (many parameters become exactly zero).

2. Convex Sets and Convex Functions

Convexity is a fundamental concept in optimization. If a problem is convex, we can usually guarantee that a local minimum is also a global minimum, which makes finding the optimal solution much simpler.

Convex Sets

A set S is convex if for any two points x₁, x₂ ∈ S, the entire line segment connecting them is also contained within S.

Formal Definition: For any x₁, x₂ ∈ S and any α with 0 ≤ α ≤ 1, the point αx₁ + (1-α)x₂ is also in S.
Examples: A circle, a square, a plane are convex. A star shape or a donut shape are non-convex.

Convex Functions

A function f(x) is convex if its domain is a convex set and the line segment connecting any two points on its graph lies on or above the graph.

Formal Definition: For any x₁, x₂ in its domain and any α with 0 ≤ α ≤ 1:
f(αx₁ + (1-α)x₂) ≤ αf(x₁) + (1-α)f(x₂)
Strictly Convex: The inequality is strict (<), which guarantees a unique global minimum.
Second-Order Condition: For a twice-differentiable function, it is convex if and only if its Hessian matrix (the matrix of second partial derivatives) is positive semi-definite for all x.
Examples:
- Convex: f(x) = x², f(x) = e^x.
- Non-Convex: f(x) = sin(x), f(x) = -x².

Implications for Machine Learning

Convex Problems: Linear Regression with MSE loss, Logistic Regression, and Support Vector Machines are convex optimization problems. For these, gradient descent is guaranteed to find the global minimum.
Non-Convex Problems: The loss landscapes of deep neural networks are highly non-convex. They contain many local minima, plateaus, and saddle points. Our goal is not necessarily to find the global minimum (which is computationally intractable) but to find a "good" local minimum that generalizes well.

3. Gradients and Directional Derivatives

Gradients are the primary tool used by optimization algorithms to decide which way to move in the parameter space.

Directional Derivative

The directional derivative D_v f(x) measures the instantaneous rate of change of a function f at a point x as you move in the direction of a unit vector v.

Definition: D_v f(x) = lim (h→0⁺) [f(x + hv) - f(x)] / h
Relationship with Gradient: It can be computed as the dot product of the gradient and the direction vector:
D_v f(x) = ∇f(x)ᵀ ⋅ v

Gradient (∇)

The gradient of a multivariate function f(θ) is a vector containing all of its partial derivatives. For θ = (θ₁, θ₂, ..., θₙ):

∇f(θ) = [ ∂f/∂θ₁, ∂f/∂θ₂, ..., ∂f/∂θₙ ]

Key Properties:

Direction of Steepest Ascent: The gradient vector ∇f(θ) at a point θ points in the direction of the greatest rate of increase of the function.
Direction of Steepest Descent: The negative gradient, -∇f(θ), points in the direction of the greatest rate of decrease. This is the core principle behind gradient descent.
Magnitude: The magnitude ||∇f(θ)|| represents the steepness of the ascent in that direction.
Orthogonality: The gradient is always orthogonal (perpendicular) to the level curves (or contour lines) of the function.
Critical Points: At a local minimum, local maximum, or saddle point, the gradient is the zero vector: ∇f(θ) = 0.

4. Gradient Descent and Variants

Gradient Descent is an iterative first-order optimization algorithm for finding a local minimum of a differentiable function.

The Core Algorithm

The algorithm starts with an initial guess for the parameters θ and repeatedly updates them by taking a step in the direction of the negative gradient.

Update Rule:
θ_{t+1} = θ_t - η * ∇J(θ_t)

θ_t: The parameters at iteration t.
η (eta): The learning rate, a small positive scalar that determines the size of each step.
∇J(θ_t): The gradient of the loss function with respect to the parameters at θ_t.

The Learning Rate η is a critical hyperparameter:

Too small: The algorithm will converge very slowly.
Too large: The algorithm may overshoot the minimum and diverge, with the loss increasing at each step.

Variants of Gradient Descent

The variants differ in how much data is used to compute the gradient at each step.

1. Batch Gradient Descent (BGD)

How it works: Calculates the gradient of the loss function using the entire training dataset for each parameter update.
∇J(θ) = (1/N) * Σᵢ<binary data, 13 bytes>₁ⁿ ∇L(h(xᵢ; θ), yᵢ)
Pros:
- Produces a stable, direct path towards the minimum.
- Guaranteed to converge to the global minimum for convex problems.
Cons:
- Extremely slow and computationally expensive for large datasets.
- Requires the entire dataset to be loaded into memory.
- Not suitable for online learning where data arrives sequentially.

2. Stochastic Gradient Descent (SGD)

How it works: Updates the parameters using the gradient calculated from a single, randomly chosen training sample at each step.
∇J(θ) ≈ ∇L(h(xᵢ; θ), yᵢ) for a randomly selected i.
Pros:
- Much faster updates, leading to quicker initial progress.
- Low memory requirement.
- The noisy updates can help the algorithm escape from shallow local minima.
Cons:
- High variance in updates leads to a noisy convergence path (the loss function fluctuates significantly).
- It never truly "converges" but oscillates around a minimum. The learning rate must be slowly decreased to approach the minimum.

3. Mini-Batch Gradient Descent

How it works: A compromise between BGD and SGD. It updates the parameters using the gradient calculated on a small, random subset of the training data called a mini-batch.
∇J(θ) ≈ (1/B) * Σ_{i ∈ batch} ∇L(h(xᵢ; θ), yᵢ) where B is the batch size (typically 32, 64, 128, etc.).
Pros:
- Reduces the variance of SGD, leading to more stable convergence.
- Allows for efficient computation by leveraging vectorized operations on GPUs.
Cons:
- Introduces a new hyperparameter (batch size).
This is the most common variant used in modern deep learning.

5. Momentum-based Optimization

Standard gradient descent can struggle in certain scenarios, such as navigating long, narrow valleys (ravines) in the loss landscape, where it tends to oscillate back and forth. Momentum-based methods help accelerate convergence in these situations.

Gradient Descent with Momentum

Intuition: Adds "inertia" to the parameter updates. Imagine a heavy ball rolling down a hill. It accumulates momentum, allowing it to roll past small bumps (local minima) and speed up in consistent downhill directions.
It maintains a velocity vector v, which is an exponentially decaying moving average of past gradients.

Update Rules:

v_t = β * v_{t-1} + η * ∇J(θ_{t-1})
θ_t = θ_{t-1} - v_t

β: The momentum coefficient, typically set to a value like 0.9. It controls how much of the past velocity is carried over.

Nesterov Accelerated Gradient (NAG)

Intuition: A "smarter" momentum that "looks ahead." Instead of computing the gradient at the current position, it first takes a step in the direction of the previous velocity, and then computes the gradient from that "look-ahead" position to make a correction.
This anticipatory step prevents the algorithm from overshooting the minimum too much.

Update Rules:

θ_lookahead = θ_{t-1} - β * v_{t-1} (Look ahead)
v_t = β * v_{t-1} + η * ∇J(θ_lookahead) (Compute gradient at look-ahead position)
θ_t = θ_{t-1} - v_t (Update using the new velocity)

NAG often converges faster than standard momentum in practice.

6. RMSProp and Adam (Adaptive Learning Rate Methods)

A fixed learning rate η may not be optimal. Some parameters may require large updates, while others need very small updates. Adaptive methods automatically adjust the learning rate for each parameter individually during training.

RMSProp (Root Mean Square Propagation)

Idea: Maintain a moving average of the squared gradients for each parameter. The learning rate for a parameter is then divided by the square root of this average.
Effect: It dampens oscillations for parameters with high-gradient dimensions and accelerates progress for parameters with small-gradient dimensions.

Update Rules:

S_t = γ * S_{t-1} + (1-γ) * [∇J(θ_{t-1})]² (Element-wise square of gradient)
θ_t = θ_{t-1} - (η / sqrt(S_t + ε)) * ∇J(θ_{t-1})

γ: The decay rate, often set to 0.9.
ε (epsilon): A small constant (e.g., 10⁻⁸) to prevent division by zero.

Adam (Adaptive Moment Estimation)

Idea: The most popular and often default optimization algorithm. It combines the ideas of Momentum (first moment estimation) and RMSProp (second moment estimation).
It keeps track of two moving averages for each parameter:
1. m: The average of the gradients (like momentum).
2. v: The average of the squared gradients (like RMSProp).

Update Rules:
Let g_t = ∇J(θ_{t-1})

Update biased first moment estimate:
m_t = β₁ * m_{t-1} + (1-β₁) * g_t
Update biased second moment estimate:
v_t = β₂ * v_{t-1} + (1-β₂) * g_t²
Compute bias-corrected estimates: (Crucial for the first few iterations when m and v are biased towards zero)
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)
Update parameters:
θ_t = θ_{t-1} - (η / sqrt(v̂_t + ε)) * m̂_t

Default Hyperparameters: η=0.001, β₁=0.9, β₂=0.999, ε=10⁻⁸. Adam is generally robust to the choice of these hyperparameters.

7. Optimization Considerations in Large-Scale ML Systems

Training state-of-the-art models on massive datasets introduces unique challenges that require specialized techniques.

Challenges

Massive Datasets: Datasets can be terabytes or petabytes in size, too large to fit into the memory of a single machine.
Enormous Models: Models like large language models (e.g., GPT-3) have billions of parameters, which may not fit into the memory of a single GPU.
Computational Expense: Training can require thousands of GPU-hours, taking days or even weeks.

Techniques and Strategies

1. Distributed Training

Distributing the training workload across multiple processors or machines is essential.

Data Parallelism: The most common approach.
- The model is replicated on each worker (e.g., each GPU).
- The training data is split, and each worker processes a different mini-batch.
- Each worker computes its own gradients.
- The gradients are aggregated (e.g., averaged) across all workers.
- All models are updated with the aggregated gradient, ensuring they stay in sync.
Model Parallelism: Used when the model itself is too large for a single device.
- The model is partitioned, with different layers or parts of layers placed on different workers.
- Data is passed between workers to complete the forward and backward passes.
- This is more complex to implement than data parallelism.

2. Learning Rate Schedules

Instead of a fixed learning rate η, it's often beneficial to adjust it during training. This is called learning rate scheduling.

Step Decay: Reduce η by a certain factor every k epochs.
Cosine Annealing: Smoothly vary the learning rate from a high value to a low value following a cosine curve over the course of training.
Warm-up: Start with a very small learning rate for the first few epochs and gradually increase it to the target learning rate. This helps stabilize the model at the beginning of training, especially for very large models.

3. Gradient Accumulation

This technique allows you to simulate a very large batch size when you are limited by GPU memory.

How it works:
1. Process a mini-batch (forward and backward pass) to compute gradients.
2. Instead of updating the model weights, accumulate the gradients.
3. Repeat this for a specified number of "accumulation steps".
4. After processing the desired number of mini-batches, average the accumulated gradients and then perform a single weight update.
For example, if your GPU can only handle a batch size of 16, you can use 4 accumulation steps to simulate an effective batch size of 64.

4. Second-Order Optimization Methods (Briefly)

Newton's Method: Uses both the gradient (first derivative) and the Hessian matrix (second derivative) to find the minimum.
- Update rule: θ_{t+1} = θ_t - H⁻¹ * ∇J(θ_t)
Pros: Converges much more rapidly (in fewer iterations) than first-order methods.
Cons: Computing, storing, and inverting the Hessian matrix (H) is computationally prohibitive for deep learning models. The cost is O(n³) where n is the number of parameters. For a model with millions of parameters, this is impossible.
Quasi-Newton Methods (e.g., L-BFGS): These methods approximate the inverse Hessian, making them more feasible. They are popular in classical ML and scientific computing but are still difficult to scale to the large, stochastic settings of deep learning compared to methods like Adam.

Unit 3

Unit 5

Unit 4 - Notes

Table of Contents

Unit 4: Optimization Techniques for Machine Learning

1. Optimization Problem Formulation in Machine Learning

General Formulation

Empirical Risk Minimization (ERM)

Regularization

2. Convex Sets and Convex Functions

Convex Sets

Convex Functions

Implications for Machine Learning

3. Gradients and Directional Derivatives

Directional Derivative

Gradient (∇)

4. Gradient Descent and Variants

The Core Algorithm

Variants of Gradient Descent

1. Batch Gradient Descent (BGD)

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent

5. Momentum-based Optimization

Gradient Descent with Momentum

Nesterov Accelerated Gradient (NAG)

6. RMSProp and Adam (Adaptive Learning Rate Methods)

RMSProp (Root Mean Square Propagation)

Adam (Adaptive Moment Estimation)

7. Optimization Considerations in Large-Scale ML Systems

Challenges

Techniques and Strategies

1. Distributed Training

2. Learning Rate Schedules

3. Gradient Accumulation

4. Second-Order Optimization Methods (Briefly)