Unit 5 - Notes
INT423
Unit 5: Q-Learning & Deep Q- Networks
1. Introduction to Q-Learning
Q-Learning is one of the most fundamental and widely used algorithms in Reinforcement Learning (RL). It is a model-free, off-policy algorithm intended to find the optimal action-selection policy for a given finite Markov Decision Process (MDP).
Key Concepts
- Model-Free: The agent does not need to know the transition probabilities () or the reward function of the environment beforehand. It learns strictly through interaction/experience.
- Off-Policy: The algorithm learns the value of the optimal policy independently of the agent's actions. It learns the "best" path even if the agent is currently exploring random paths.
- The Q-Value (Quality):
- represents the expected cumulative future reward of taking action in state , and subsequently following the optimal policy.
- Ideally, we want to learn a function that tells the agent exactly how good it is to take a specific action in a specific state.
The Q-Table
In traditional Q-learning, knowledge is stored in a lookup table called a Q-Table.
- Rows: Represent States ().
- Columns: Represent Actions ().
- Cells: Contain the value.
Initially, the table is initialized to zero or random values. As the agent interacts with the environment, these values are updated to approximate the true optimal values.
2. The Q-Learning Algorithm
The core of Q-learning is the Bellman Optimality Equation. The algorithm iteratively updates Q-values based on the reward received and the estimated future value.
The Q-Learning Update Rule
Parameters Explained
- Learning Rate (): Determines to what extent newly acquired information overrides old information. .
- : The agent learns nothing (Q-values never change).
- : The agent considers only the most recent information.
- Discount Factor (): Determines the importance of future rewards. .
- : The agent is "myopic" (cares only about immediate reward).
- : The agent strives for long-term high reward.
- Temporal Difference (TD) Target: . This is the "ground truth" estimate derived from the immediate reward and the best possible future prediction.
- TD Error: The difference between the TD Target and the current Q-value.
Pseudocode
Initialize Q(s, a) arbitrarily (usually 0)
Repeat (for each episode):
Initialize state S
Repeat (for each step of episode):
Choose action A from S using Policy (e.g., Epsilon-Greedy)
Take action A, observe Reward R, and next state S'
# The Update Step
Q_max = max(Q(S', a) for all actions a)
Q(S, A) = Q(S, A) + alpha * (R + gamma * Q_max - Q(S, A))
S = S'
Until S is terminal
3. Epsilon-Greedy Strategy
In RL, the agent faces the Exploration-Exploitation Dilemma:
- Exploitation: Choosing the action currently believed to be the best (highest Q-value) to maximize reward.
- Exploration: Choosing a random action to discover potentially better states or strategies that are currently unknown.
If an agent only exploits, it may get stuck in a local optimum. If it only explores, it never utilizes its knowledge. The -greedy strategy balances this.
The Strategy
At each time step, the agent selects an action based on a probability (epsilon):
- Generate a random number between 0 and 1.
- If (Exploration):
- Select a random action from the available action space.
- If (Exploitation):
- Select the action with the highest Q-value: .
Epsilon Decay
Usually, we want high exploration at the start of training (when the agent knows nothing) and high exploitation towards the end (when the agent is trained).
- Decay: Start at 1.0 (100% random) and multiply it by a decay factor (e.g., 0.995) after every episode until it reaches a minimum value (e.g., 0.01).
4. Deep Q-Networks (DQN)
Traditional Q-Learning using a Q-Table is powerful but limited. It fails when the state space is huge or continuous (e.g., an image from a video game has possible states). A table cannot store this.
Deep Q-Learning replaces the Q-Table with a Neural Network (Function Approximator).
Architecture
- Input: The state representation (e.g., 4 stacked image frames of a game).
- Hidden Layers: Convolutional layers (if images) or Dense layers to extract features.
- Output: A vector of Q-values, one for each possible action.
- Example: In a game with "Left" and "Right", the network outputs .
Challenges in Deep RL
Naive implementation of Neural Networks in RL is unstable because:
- Correlation: Samples in RL (sequential video frames) are highly correlated. Neural networks assume independent, identically distributed (i.i.d) data.
- Non-stationary Target: In Q-learning, the target () changes as the network weights update. The network is chasing a moving target.
DQN introduced two key innovations to solve these:
1. Experience Replay (Replay Buffer)
Instead of updating the network immediately after every step, the agent stores the experience tuple in a finite-sized memory buffer (Replay Buffer).
- Training: During training, random mini-batches are sampled from this buffer.
- Benefit: This breaks the temporal correlation between consecutive samples and stabilizes training.
2. Fixed Target Network
DQN uses two neural networks with the same architecture:
- Main Network (Policy Network - ): Used to select actions and is updated at every step.
- Target Network (): Used to calculate the target Q-values ().
- Mechanism: The weights of the Main Network are copied to the Target Network only every steps (e.g., every 10,000 steps).
- Benefit: Keeps the target static for a while, preventing the "chasing your own tail" instability.
DQN Loss Function
We minimize the Mean Squared Error (MSE) between the Target Q-value and the Predicted Q-value:
5. Double DQN (DDQN)
The Problem with DQN: Overestimation
Standard DQN tends to overestimate Q-values.
- Reason: The maximization operator () inside the target calculation uses the same network to both select and evaluate an action.
- If the network makes a mistake and assigns a high value to a suboptimal action, the
maxoperator picks it, propagating the error.
The Solution
Double DQN decouples the action selection from the action evaluation.
- Selection: Use the Main Network () to decide which action is best in the next state.
- Evaluation: Use the Target Network () to calculate the Q-value of that selected action.
The DDQN Update Equation
This simple change significantly reduces overestimation bias and leads to more stable learning.
6. Dueling DQN
Dueling DQN represents an innovation in Neural Network Architecture, rather than the update algorithm itself.
The Intuition
In many states, the value of the state matters more than the specific action taken.
- Example: In a driving game, if the car is about to crash into a wall, the value of the state is very low (imminent death). It doesn't matter much whether you steer left or right; the state value dominates.
Architecture Split
Standard DQN outputs Q-values directly. Dueling DQN splits the network into two separate streams after the convolutional layers:
- Value Stream : Estimates the value of being in state (a scalar).
- Advantage Stream : Estimates how much better taking action is compared to the average action in state (vector of size |actions|).
Aggregation Layer
The two streams are combined at the end to produce the final Q-values:
However, to ensure the equation is identifiable (uniquely solvable), we force the mean of the advantages to be zero. The actual implementation uses:
Benefits
- Faster Convergence: The agent learns the state value independently of the action. This is efficient because is updated every time state is visited, regardless of the action taken.
- Better Policy: It identifies correct actions more quickly in states where action choice has little effect on the outcome.
Summary Comparison Table
| Algorithm | Model Type | Core Mechanism | Solves |
|---|---|---|---|
| Q-Learning | Tabular | Bellman Equation Update | Basic RL in small discrete spaces. |
| DQN | Deep Neural Net | Experience Replay + Target Net | High-dimensional state spaces (e.g., images). |
| Double DQN | Deep Neural Net | Decoupled Selection/Evaluation | Overestimation bias of Q-values. |
| Dueling DQN | Deep Neural Net | Split Architecture (Value/Advantage) | Faster learning by isolating state value. |