Unit 4 - Notes
INT423
Unit 4: Foundations of Reinforcement Learning
1. What is Reinforcement Learning (RL)?
Reinforcement Learning is a subfield of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results. Unlike supervised learning, where the training data comes with correct answer keys (labels), in RL, the learner is not told which actions to take but must discover which actions yield the most reward by trying them.
Core Concepts and Terminology
- Agent: The entity that learns and makes decisions.
- Environment: The world through which the agent moves and responds.
- State (): A representation of the environment's situation at a specific time step .
- Action (): The move the agent decides to make.
- Reward (): A scalar feedback signal from the environment indicating how well the agent is doing at step .
- Episode: A sequence of states, actions, and rewards from an initial state to a terminal state (e.g., a single game of chess).
The RL Loop
- The Agent observes state .
- The Agent selects action .
- The Environment receives action and transitions to a new state .
- The Environment provides a reward to the Agent.
- The cycle repeats.
The Reward Hypothesis
All goals can be described by the maximization of the expected cumulative reward.
2. Markov Decision Process (MDP)
An MDP is the mathematical framework used to describe an environment in reinforcement learning. An environment is considered an MDP if it satisfies the Markov Property.
The Markov Property
A state is Markov if and only if:
This means the future depends only on the current state, not on the history of events that occurred before the current state. The current state captures all relevant information from the past.
Components of an MDP
An MDP is defined by a tuple :
- (State Space): A set of all valid states.
- (Action Space): A set of all valid actions.
- (Transition Probability): The dynamics of the system.
Probability of moving to state given current state and action . - (Reward Function): The expected immediate reward.
- (Discount Factor): A number between $0$ and $1$.
The Discount Factor ()
The discount factor determines the present value of future rewards.
- : The agent is "myopic" (cares only about immediate rewards).
- : The agent is "far-sighted" (cares about long-term returns).
Why discount?
- Mathematical convergence (prevents infinite returns in cyclic processes).
- Uncertainty about the future.
- Immediate rewards are generally preferred (financial interest).
3. Policies and Value Functions
Policy ()
A policy is the agent's behavior function—a map from state to action. It defines how the agent behaves at a given time.
- Deterministic Policy: Maps state directly to action.
- Stochastic Policy: Outputs a probability distribution over actions.
Return ()
The goal of RL is to maximize the cumulative discounted return ().
Value Functions
Value functions measure "how good" it is to be in a specific state (or state-action pair).
1. State-Value Function
The expected return starting from state and following policy thereafter.
2. Action-Value Function
The expected return starting from state , taking action , and following policy thereafter. This is crucial for control (deciding which action is best).
4. Bellman Equation
The Bellman Equation decomposes the value function into two parts:
- The immediate reward ().
- The discounted value of the successor state ().
This recursive definition allows us to solve for optimal policies using dynamic programming.
Bellman Expectation Equation (for )
Interpretation: The value of state is the average of the values of the next possible states , weighted by how likely we are to go there.
Bellman Optimality Equation
To find the best possible behavior, we look for the Optimal Value Function () and Optimal Policy ().
If an agent knows , the optimal policy is simply to take the action with the highest -value:
5. Exploration Vs Exploitation
This is the fundamental dilemma in Reinforcement Learning.
- Exploitation: Making the best decision given current information. The agent uses what it already knows to maximize reward.
- Risk: The agent might get stuck in a local optimum and miss a bigger reward elsewhere.
- Exploration: Gathering more information. The agent tries new actions that might result in lower short-term rewards but could reveal higher long-term rewards.
- Risk: The agent loses immediate reward to learn about the environment.
Strategies to Manage the Trade-off
1. -Greedy (Epsilon-Greedy)
- With probability : Choose the "greedy" action (the one with the highest estimated value).
- With probability : Choose a random action (explore).
- Note: Often, is decayed over time (high exploration at the start, high exploitation later).
2. Optimistic Initialization
Initialize all -values to a high number. The agent will be disappointed by initial rewards (which are lower than the estimate) and will switch actions rapidly, exploring the environment until values converge to reality.
3. Upper Confidence Bound (UCB)
Select actions based on their potential. Actions that have not been tested often have a high "uncertainty bonus" added to their value estimate, encouraging the agent to try them.
6. Learning Methods
When the environment dynamics ( and ) are unknown, the agent cannot use standard dynamic programming. It must learn from experience. These are Model-Free methods.
A. Monte Carlo (MC) Learning
MC methods learn directly from episodes of experience without bootstrapping.
- Requirement: Can only be used for episodic tasks (tasks that terminate).
- Method: The agent plays through an entire episode. At the end, it looks back and calculates the total return for every state visited.
- Update Rule:
Where is the learning rate. - Characteristics:
- Unbiased: is the actual return.
- High Variance: Because the return depends on many random actions and transitions over a long episode.
- Must wait until the end of the episode to update.
B. Temporal Difference (TD) Learning
TD learning is a combination of Monte Carlo ideas and Dynamic Programming (DP) ideas. Like MC, TD learns from experience without a model. Like DP, TD updates estimates based in part on other learned estimates (Bootstrapping).
- Requirement: Can be used for continuous or episodic tasks.
- Method: The agent updates its estimate after every single step, using the immediate reward and the estimate of the next state.
- TD(0) Update Rule:
- Target: (The TD Target).
- TD Error: .
- Characteristics:
- Biased: The update relies on , which is a guess.
- Low Variance: Depends only on one step of randomness.
- Learns online (step-by-step), making it faster to converge in many scenarios.
Comparison: MC vs. TD
| Feature | Monte Carlo (MC) | Temporal Difference (TD) |
|---|---|---|
| Update Time | End of episode | Every time step |
| Bootstrapping | No (uses actual return ) | Yes (uses guess ) |
| Bias/Variance | Zero Bias / High Variance | Some Bias / Low Variance |
| Environment | Episodic only | Episodic or Continuous |
| Analogy | Learning the duration of a commute by driving it 100 times and averaging. | Learning the commute duration by observing current traffic and updating the prediction while driving. |