Unit 4 - Notes

INT423 7 min read

Unit 4: Foundations of Reinforcement Learning

1. What is Reinforcement Learning (RL)?

Reinforcement Learning is a subfield of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results. Unlike supervised learning, where the training data comes with correct answer keys (labels), in RL, the learner is not told which actions to take but must discover which actions yield the most reward by trying them.

Core Concepts and Terminology

Agent: The entity that learns and makes decisions.
Environment: The world through which the agent moves and responds.
State ( $S_t$ ): A representation of the environment's situation at a specific time step $t$ .
Action ( $A_t$ ): The move the agent decides to make.
Reward ( $R_t$ ): A scalar feedback signal from the environment indicating how well the agent is doing at step $t$ .
Episode: A sequence of states, actions, and rewards from an initial state to a terminal state (e.g., a single game of chess).

The RL Loop

The Agent observes state $S_t$ .
The Agent selects action $A_t$ .
The Environment receives action $A_t$ and transitions to a new state $S_{t+1}$ .
The Environment provides a reward $R_{t+1}$ to the Agent.
The cycle repeats.

The Reward Hypothesis

All goals can be described by the maximization of the expected cumulative reward.

2. Markov Decision Process (MDP)

An MDP is the mathematical framework used to describe an environment in reinforcement learning. An environment is considered an MDP if it satisfies the Markov Property.

The Markov Property

A state $S_t$ is Markov if and only if:

P(S_{t+1} | S_t) = P(S_{t+1} | S_1, S_2, \dots, S_t)

This means the future depends only on the current state, not on the history of events that occurred before the current state. The current state captures all relevant information from the past.

Components of an MDP

An MDP is defined by a tuple $(S, A, P, R, \gamma)$ :

$S$ (State Space): A set of all valid states.
$A$ (Action Space): A set of all valid actions.
$P$ (Transition Probability): The dynamics of the system.
$P(s'|s, a) = P(S_{t+1} = s' | S_t = s, A_t = a)$
Probability of moving to state $s'$ given current state $s$ and action $a$ .
$R$ (Reward Function): The expected immediate reward.
$R(s, a) = E[R_{t+1} | S_t = s, A_t = a]$
$\gamma$ (Discount Factor): A number between $0$ and $1$.

The Discount Factor ( $\gamma$ )

The discount factor determines the present value of future rewards.

$\gamma = 0$ : The agent is "myopic" (cares only about immediate rewards).
$\gamma \to 1$ : The agent is "far-sighted" (cares about long-term returns).

Why discount?

Mathematical convergence (prevents infinite returns in cyclic processes).
Uncertainty about the future.
Immediate rewards are generally preferred (financial interest).

3. Policies and Value Functions

Policy ( $\pi$ )

A policy is the agent's behavior function—a map from state to action. It defines how the agent behaves at a given time.

Deterministic Policy: Maps state directly to action.
$a = \pi(s)$
Stochastic Policy: Outputs a probability distribution over actions.
$\pi(a|s) = P(A_t = a | S_t = s)$

Return ( $G_t$ )

The goal of RL is to maximize the cumulative discounted return ( $G_t$ ).

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Value Functions

Value functions measure "how good" it is to be in a specific state (or state-action pair).

1. State-Value Function $V_{\pi}(s)$
The expected return starting from state $s$ and following policy $\pi$ thereafter.

V_{\pi}(s) = E_{\pi}[G_t | S_t = s]

2. Action-Value Function $Q_{\pi}(s, a)$
The expected return starting from state $s$ , taking action $a$ , and following policy $\pi$ thereafter. This is crucial for control (deciding which action is best).

Q_{\pi}(s, a) = E_{\pi}[G_t | S_t = s, A_t = a]

4. Bellman Equation

The Bellman Equation decomposes the value function into two parts:

The immediate reward ( $R_{t+1}$ ).
The discounted value of the successor state ( $\gamma V(S_{t+1})$ ).

This recursive definition allows us to solve for optimal policies using dynamic programming.

Bellman Expectation Equation (for $V_\pi$ )

V_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s, a) [R(s, a) + \gamma V_{\pi}(s')]

Interpretation: The value of state $s$ is the average of the values of the next possible states $s'$ , weighted by how likely we are to go there.

Bellman Optimality Equation

To find the best possible behavior, we look for the Optimal Value Function ( $V_*$ ) and Optimal Policy ( $\pi_*$ ).

V_*(s) = \max_{a} \sum_{s'} P(s'|s, a) [R(s, a) + \gamma V_*(s')]

Q_*(s, a) = \sum_{s'} P(s'|s, a) [R(s, a) + \gamma \max_{a'} Q_*(s', a')]

If an agent knows $Q_*(s, a)$ , the optimal policy is simply to take the action with the highest $Q$ -value:

\pi_*(s) = \arg\max_{a} Q_*(s, a)

5. Exploration Vs Exploitation

This is the fundamental dilemma in Reinforcement Learning.

Exploitation: Making the best decision given current information. The agent uses what it already knows to maximize reward.
- Risk: The agent might get stuck in a local optimum and miss a bigger reward elsewhere.
Exploration: Gathering more information. The agent tries new actions that might result in lower short-term rewards but could reveal higher long-term rewards.
- Risk: The agent loses immediate reward to learn about the environment.

Strategies to Manage the Trade-off

1. $\epsilon$ -Greedy (Epsilon-Greedy)

With probability $1-\epsilon$ : Choose the "greedy" action (the one with the highest estimated value).
With probability $\epsilon$ : Choose a random action (explore).
Note: Often, $\epsilon$ is decayed over time (high exploration at the start, high exploitation later).

2. Optimistic Initialization
Initialize all $Q$ -values to a high number. The agent will be disappointed by initial rewards (which are lower than the estimate) and will switch actions rapidly, exploring the environment until values converge to reality.

3. Upper Confidence Bound (UCB)
Select actions based on their potential. Actions that have not been tested often have a high "uncertainty bonus" added to their value estimate, encouraging the agent to try them.

6. Learning Methods

When the environment dynamics ( $P$ and $R$ ) are unknown, the agent cannot use standard dynamic programming. It must learn from experience. These are Model-Free methods.

A. Monte Carlo (MC) Learning

MC methods learn directly from episodes of experience without bootstrapping.

Requirement: Can only be used for episodic tasks (tasks that terminate).
Method: The agent plays through an entire episode. At the end, it looks back and calculates the total return $G_t$ for every state visited.
Update Rule:
$V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]$
Where $\alpha$ is the learning rate.
Characteristics:
- Unbiased: $G_t$ is the actual return.
- High Variance: Because the return depends on many random actions and transitions over a long episode.
- Must wait until the end of the episode to update.

B. Temporal Difference (TD) Learning

TD learning is a combination of Monte Carlo ideas and Dynamic Programming (DP) ideas. Like MC, TD learns from experience without a model. Like DP, TD updates estimates based in part on other learned estimates (Bootstrapping).

Requirement: Can be used for continuous or episodic tasks.
Method: The agent updates its estimate after every single step, using the immediate reward and the estimate of the next state.
TD(0) Update Rule:
- Target: $R_{t+1} + \gamma V(S_{t+1})$ (The TD Target).
- TD Error: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ .
Characteristics:
- Biased: The update relies on $V(S_{t+1})$ , which is a guess.
- Low Variance: Depends only on one step of randomness.
- Learns online (step-by-step), making it faster to converge in many scenarios.

Comparison: MC vs. TD

Feature	Monte Carlo (MC)	Temporal Difference (TD)
Update Time	End of episode	Every time step
Bootstrapping	No (uses actual return $G_t$ )	Yes (uses guess $V(S_{t+1})$ )
Bias/Variance	Zero Bias / High Variance	Some Bias / Low Variance
Environment	Episodic only	Episodic or Continuous
Analogy	Learning the duration of a commute by driving it 100 times and averaging.	Learning the commute duration by observing current traffic and updating the prediction while driving.

Unit 3

Unit 5