Unit 5 - Notes

INT394 7 min read

Unit 5: Reinforcement Learning

1. Fundamentals of Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Unlike supervised learning, RL does not rely on labeled data; instead, it learns through trial and error.

1.1 Core Characteristics

No Supervisor: The agent is not told which action to take; it must discover good actions by trying them.
Reward Signal: The only feedback is a scalar number (reward), which may be delayed.
Sequential Decision Making: Time matters. Decisions affect not only the immediate reward but also the next state and, through that, all subsequent rewards.
Feedback Loop: The agent’s actions influence the subsequent data it receives.

1.2 Exploration vs. Exploitation Dilemma

A fundamental challenge in RL is balancing two competing objectives:

Exploration: The agent tries new actions to discover their rewards. This is necessary to find better strategies but risks lower short-term rewards.
Exploitation: The agent uses its current knowledge to select the action known to yield the highest reward. This maximizes short-term performance but prevents learning optimal long-term strategies.

1.3 Comparison with Other ML Types

Feature	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Data	Labeled ( $x, y$ )	Unlabeled ( $x$ )	State-Action-Reward Interaction
Goal	Minimize error	Find hidden structures	Maximize cumulative reward
Feedback	Instant and correct	None	Delayed and evaluative

2. Components of Reinforcement Learning

The RL process is modeled as a loop of interaction between an Agent and an Environment.

2.1 The Major Elements

Agent: The learner or decision-maker.
Environment: The physical or virtual world outside the agent.
State ( $S_t$ ): A representation of the current situation of the environment at time step $t$ .
Action ( $A_t$ ): The move or decision made by the agent at time step $t$ .
Reward ( $R_t$ ): A numerical value received from the environment after an action is taken. It indicates the immediate success of an action.

2.2 The Interaction Loop

The agent observes state $S_t$ .
The agent selects action $A_t$ .
The environment responds with a new state $S_{t+1}$ and a reward $R_{t+1}$ .
The cycle repeats.

2.3 Policy ( $\pi$ )

The policy is the agent's behavior strategy. It is a mapping from state to action.

Deterministic Policy: $a = \pi(s)$ (Always takes the same action in a specific state).
Stochastic Policy: $\pi(a|s) = P(A_t=a | S_t=s)$ (Probabilities of taking specific actions).

2.4 Model of the Environment

The model is the agent’s representation of how the environment works.

Transition Probability: Predicts the next state.
Reward Function: Predicts the immediate reward.
Model-Based RL: The agent learns a model and plans using it.
Model-Free RL: The agent learns a policy or value function directly from experience without learning the physics of the environment.

3. Markov Decision Processes (MDPs)

An MDP is a discrete-time stochastic control process. It provides the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

3.1 The Markov Property

A state $S_t$ is Markov if and only if:

P(S_{t+1} | S_t) = P(S_{t+1} | S_1, S_2, ..., S_t)

Translation: The future depends only on the present, not on the history. Once the current state is known, the history becomes irrelevant.

3.2 MDP Tuple

An MDP is defined by a 5-tuple $(S, A, P, R, \gamma)$ :

$S$ : A finite set of states.
$A$ : A finite set of actions.
$P$ : State transition probability matrix.
$P_{ss'}^a = P(S_{t+1} = s' | S_t = s, A_t = a)$
$R$ : Reward function.
$R_s^a = E[R_{t+1} | S_t = s, A_t = a]$
$\gamma$ (Gamma): A discount factor $\gamma \in [0, 1]$ .

3.3 The Goal: Cumulative Reward (Return)

The agent seeks to maximize the expected return ( $G_t$ ). The return is the total discounted reward from time-step $t$ .

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Why Discounting ( $\gamma$ )?

Mathematical Convergence: Ensures the sum is finite for infinite horizons.
Uncertainty: Future rewards are less certain than immediate ones.
Value: $\gamma \rightarrow 0$ leads to "myopic" evaluation (cares only about immediate reward). $\gamma \rightarrow 1$ leads to "far-sighted" evaluation.

4. Value Function and Bellman Equations

Value functions estimate "how good" it is for an agent to be in a given state (or to perform a given action in a given state).

4.1 State-Value Function $V(s)$

The expected return starting from state $s$ and following policy $\pi$ .

V_\pi(s) = E_\pi [G_t | S_t = s]

4.2 Action-Value Function $Q(s, a)$

The expected return starting from state $s$ , taking action $a$ , and then following policy $\pi$ .

Q_\pi(s, a) = E_\pi [G_t | S_t = s, A_t = a]

4.3 Bellman Expectation Equations

The Bellman equation decomposes the value function into two parts: the immediate reward plus the discounted value of the successor state.

For $V(s)$ :

V_\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V_\pi(s')]

For $Q(s, a)$ :

Q_\pi(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a'|s') Q_\pi(s', a')]

4.4 Bellman Optimality Equations

The goal of RL is to find the optimal policy $\pi^*$ . The optimal value functions are denoted $V^*$ and $Q^*$ .

Optimal State-Value:

V^*(s) = \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V^*(s')]

Optimal Action-Value:

Q^*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} Q^*(s', a')]

Significance: If we know $Q^*(s, a)$ , the optimal policy is simply to take the action $a$ that maximizes $Q^*(s, a)$ (Greedy Policy).

5. Temporal Difference (TD) Learning

TD Learning is a central concept in modern RL. It combines ideas from Monte Carlo methods and Dynamic Programming.

5.1 Concept

Model-Free: Like Monte Carlo, TD learns directly from raw experience without a model of the environment dynamics.
Bootstrapping: Like Dynamic Programming, TD updates estimates based in part on other learned estimates, without waiting for a final outcome (episode termination).

5.2 TD(0) Update Rule

The simplest form of TD learning estimates $V(s)$ .

V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

Where:

$V(S_t)$ : Current estimate of the value of state $S_t$ .
$\alpha$ : Learning rate (step size).
$R_{t+1} + \gamma V(S_{t+1})$ : The TD Target (Estimated return).
$R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ : The TD Error ( $\delta$ ).

5.3 Advantages

Can learn online after every step (unlike Monte Carlo which waits for end of episode).
Usually converges faster than Monte Carlo.
Does not require a model of $P$ (transitions) or $R$ (rewards).

6. Q-Learning

Q-Learning is one of the most popular model-free RL algorithms. It is an Off-Policy TD control algorithm.

6.1 Off-Policy vs. On-Policy

On-Policy (e.g., SARSA): Attempts to evaluate or improve the policy that is currently being used to make decisions.
Off-Policy (Q-Learning): Evaluates or improves a policy different from the one used to generate the data. It learns the value of the optimal policy regardless of the agent's actual actions (as long as it explores enough).

6.2 The Q-Learning Equation

The core of the algorithm is the update rule for the Q-value (Action-Value):

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t) \right]

Breakdown:

$Q(S_t, A_t)$ : Old Q-value.
$\alpha$ : Learning rate.
$R_{t+1}$ : Immediate reward.
$\max_{a} Q(S_{t+1}, a)$ : Estimate of optimal future value. Note: We take the max over all actions in the next state, assuming we will act optimally, even if we are currently exploring.

6.3 The Algorithm

PYTHON

Initialize Q(s, a) arbitrarily
Repeat (for each episode):
    Initialize S
    Repeat (for each step of episode):
        Choose A from S using policy derived from Q (e.g., epsilon-greedy)
        Take action A, observe R, S'
        
        # The Update
        Q(S, A) = Q(S, A) + alpha * (R + gamma * max(Q(S', a)) - Q(S, A))
        
        S = S'
    Until S is terminal

6.4 Convergence

Q-learning is proven to converge to the optimal action-value function $Q^*$ with probability 1, provided that all state-action pairs are visited infinitely often and the step-size parameter $\alpha$ decays appropriately.

Unit 4

Unit 6