Unit 5 - Notes

INT423 8 min read

Unit 5: Q-Learning & Deep Q- Networks

1. Introduction to Q-Learning

Q-Learning is one of the most fundamental and widely used algorithms in Reinforcement Learning (RL). It is a model-free, off-policy algorithm intended to find the optimal action-selection policy for a given finite Markov Decision Process (MDP).

Key Concepts

Model-Free: The agent does not need to know the transition probabilities ( $P(s'|s,a)$ ) or the reward function of the environment beforehand. It learns strictly through interaction/experience.
Off-Policy: The algorithm learns the value of the optimal policy independently of the agent's actions. It learns the "best" path even if the agent is currently exploring random paths.
The Q-Value (Quality):
- $Q(s, a)$ represents the expected cumulative future reward of taking action $a$ in state $s$ , and subsequently following the optimal policy.
- Ideally, we want to learn a function $Q^*(s, a)$ that tells the agent exactly how good it is to take a specific action in a specific state.

The Q-Table

In traditional Q-learning, knowledge is stored in a lookup table called a Q-Table.

Rows: Represent States ( $S$ ).
Columns: Represent Actions ( $A$ ).
Cells: Contain the $Q(s, a)$ value.

Initially, the table is initialized to zero or random values. As the agent interacts with the environment, these values are updated to approximate the true optimal values.

2. The Q-Learning Algorithm

The core of Q-learning is the Bellman Optimality Equation. The algorithm iteratively updates Q-values based on the reward received and the estimated future value.

The Q-Learning Update Rule

Q^{new}(s_t, a_t) \leftarrow \underbrace{Q(s_t, a_t)}_{\text{Current Value}} + \underbrace{\alpha}_{\text{Learning Rate}} \cdot \left[ \underbrace{R_{t+1}}_{\text{Reward}} + \underbrace{\gamma}_{\text{Discount}} \cdot \underbrace{\max_{a} Q(s_{t+1}, a)}_{\text{Max Future Value}} - \underbrace{Q(s_t, a_t)}_{\text{Current Value}} \right]

Parameters Explained

Learning Rate ( $\alpha$ ): Determines to what extent newly acquired information overrides old information. .
- $\alpha = 0$ : The agent learns nothing (Q-values never change).
- $\alpha = 1$ : The agent considers only the most recent information.
Discount Factor ( $\gamma$ ): Determines the importance of future rewards. .
- $\gamma \approx 0$ : The agent is "myopic" (cares only about immediate reward).
- $\gamma \approx 1$ : The agent strives for long-term high reward.
Temporal Difference (TD) Target: $R_{t+1} + \gamma \max_{a} Q(s_{t+1}, a)$ . This is the "ground truth" estimate derived from the immediate reward and the best possible future prediction.
TD Error: The difference between the TD Target and the current Q-value.

Pseudocode

PYTHON

Initialize Q(s, a) arbitrarily (usually 0)
Repeat (for each episode):
    Initialize state S
    Repeat (for each step of episode):
        Choose action A from S using Policy (e.g., Epsilon-Greedy)
        Take action A, observe Reward R, and next state S'
        
        # The Update Step
        Q_max = max(Q(S', a) for all actions a)
        Q(S, A) = Q(S, A) + alpha * (R + gamma * Q_max - Q(S, A))
        
        S = S'
    Until S is terminal

3. Epsilon-Greedy Strategy

In RL, the agent faces the Exploration-Exploitation Dilemma:

Exploitation: Choosing the action currently believed to be the best (highest Q-value) to maximize reward.
Exploration: Choosing a random action to discover potentially better states or strategies that are currently unknown.

If an agent only exploits, it may get stuck in a local optimum. If it only explores, it never utilizes its knowledge. The $\epsilon$ -greedy strategy balances this.

The Strategy

At each time step, the agent selects an action based on a probability $\epsilon$ (epsilon):

Generate a random number $p$ between 0 and 1.
If $p < \epsilon$ (Exploration):
- Select a random action from the available action space.
If $p \ge \epsilon$ (Exploitation):
- Select the action with the highest Q-value: $a = \text{argmax}_a Q(s, a)$ .

Epsilon Decay

Usually, we want high exploration at the start of training (when the agent knows nothing) and high exploitation towards the end (when the agent is trained).

Decay: Start $\epsilon$ at 1.0 (100% random) and multiply it by a decay factor (e.g., 0.995) after every episode until it reaches a minimum value (e.g., 0.01).

4. Deep Q-Networks (DQN)

Traditional Q-Learning using a Q-Table is powerful but limited. It fails when the state space is huge or continuous (e.g., an image from a video game has $256^{pixels}$ possible states). A table cannot store this.

Deep Q-Learning replaces the Q-Table with a Neural Network (Function Approximator).

Architecture

Input: The state representation (e.g., 4 stacked image frames of a game).
Hidden Layers: Convolutional layers (if images) or Dense layers to extract features.
Output: A vector of Q-values, one for each possible action.
- Example: In a game with "Left" and "Right", the network outputs $[Q(s, \text{Left}), Q(s, \text{Right})]$ .

Challenges in Deep RL

Naive implementation of Neural Networks in RL is unstable because:

Correlation: Samples in RL (sequential video frames) are highly correlated. Neural networks assume independent, identically distributed (i.i.d) data.
Non-stationary Target: In Q-learning, the target ( $R + \gamma \max Q$ ) changes as the network weights update. The network is chasing a moving target.

DQN introduced two key innovations to solve these:

1. Experience Replay (Replay Buffer)

Instead of updating the network immediately after every step, the agent stores the experience tuple $(s_t, a_t, r_{t+1}, s_{t+1})$ in a finite-sized memory buffer (Replay Buffer).

Training: During training, random mini-batches are sampled from this buffer.
Benefit: This breaks the temporal correlation between consecutive samples and stabilizes training.

2. Fixed Target Network

DQN uses two neural networks with the same architecture:

Main Network (Policy Network - $\theta$ ): Used to select actions and is updated at every step.
Target Network ( $\theta^-$ ): Used to calculate the target Q-values ().
- Mechanism: The weights of the Main Network are copied to the Target Network only every $C$ steps (e.g., every 10,000 steps).
- Benefit: Keeps the target static for a while, preventing the "chasing your own tail" instability.

DQN Loss Function

We minimize the Mean Squared Error (MSE) between the Target Q-value and the Predicted Q-value:

L(\theta) = E \left[ \left( \underbrace{R + \gamma \max_{a'} Q(s', a'; \theta^-)}_{\text{Target (using frozen weights)}} - \underbrace{Q(s, a; \theta)}_{\text{Prediction (using active weights)}} \right)^2 \right]

5. Double DQN (DDQN)

The Problem with DQN: Overestimation

Standard DQN tends to overestimate Q-values.

Reason: The maximization operator ( $\max_{a'} Q(s', a')$ ) inside the target calculation uses the same network to both select and evaluate an action.
If the network makes a mistake and assigns a high value to a suboptimal action, the max operator picks it, propagating the error.

The Solution

Double DQN decouples the action selection from the action evaluation.

Selection: Use the Main Network ( $\theta$ ) to decide which action is best in the next state.
Evaluation: Use the Target Network ( $\theta^-$ ) to calculate the Q-value of that selected action.

The DDQN Update Equation

Y_{DDQN} = R + \gamma \cdot Q(s', \underbrace{\text{argmax}_{a} Q(s', a; \theta)}_{\text{Selection via Main Net}}; \underbrace{\theta^-}_{\text{Eval via Target Net}})

This simple change significantly reduces overestimation bias and leads to more stable learning.

6. Dueling DQN

Dueling DQN represents an innovation in Neural Network Architecture, rather than the update algorithm itself.

The Intuition

In many states, the value of the state matters more than the specific action taken.

Example: In a driving game, if the car is about to crash into a wall, the value of the state is very low (imminent death). It doesn't matter much whether you steer left or right; the state value dominates.

Architecture Split

Standard DQN outputs Q-values directly. Dueling DQN splits the network into two separate streams after the convolutional layers:

Value Stream $V(s)$ : Estimates the value of being in state $s$ (a scalar).
Advantage Stream $A(s, a)$ : Estimates how much better taking action $a$ is compared to the average action in state $s$ (vector of size |actions|).

Aggregation Layer

The two streams are combined at the end to produce the final Q-values:

Q(s, a) = V(s) + A(s, a)

However, to ensure the equation is identifiable (uniquely solvable), we force the mean of the advantages to be zero. The actual implementation uses:

Q(s, a; \theta) = V(s; \theta) + \left( A(s, a; \theta) - \frac{1}{|A|} \sum_{a'} A(s, a'; \theta) \right)

Benefits

Faster Convergence: The agent learns the state value $V(s)$ independently of the action. This is efficient because $V(s)$ is updated every time state $s$ is visited, regardless of the action taken.
Better Policy: It identifies correct actions more quickly in states where action choice has little effect on the outcome.

Summary Comparison Table

Algorithm	Model Type	Core Mechanism	Solves
Q-Learning	Tabular	Bellman Equation Update	Basic RL in small discrete spaces.
DQN	Deep Neural Net	Experience Replay + Target Net	High-dimensional state spaces (e.g., images).
Double DQN	Deep Neural Net	Decoupled Selection/Evaluation	Overestimation bias of Q-values.
Dueling DQN	Deep Neural Net	Split Architecture (Value/Advantage)	Faster learning by isolating state value.

Unit 4

Unit 6