1 $What is the primary goal of an agent in Reinforcement Learning?$

A.

To classify data into distinct categories based on labeled examples

B.

To maximize the cumulative reward over time

C.

To minimize the reconstruction error of the input data

D.

To find hidden structures in unlabeled data

2 $Which of the following tuple representations correctly defines a Markov Decision Process (MDP)?$

A.

B.

C.

D.

3 $What does the Markov Property imply about the state of an environment?$

A.

The future is independent of the past given the present

B.

The future depends on the entire history of past states

C.

The current state provides no information about the future

D.

The transition probabilities change over time

4 $In the context of RL, what does the discount factor (gamma) control?$

A.

The learning rate of the agent

B.

The exploration rate of the agent

C.

The importance of immediate rewards versus future rewards

D.

The probability of transitioning to a random state

5 $What distinguishes Reinforcement Learning from Supervised Learning?$

A.

RL relies on a static dataset with labeled targets

B.

RL maps inputs to outputs without any feedback

C.

RL learns from interaction and delayed feedback (rewards) rather than explicit labels

D.

RL is only used for continuous value prediction

6 $What is a Policy () in Reinforcement Learning?$

A.

A function that predicts the next state given the current state

B.

A mapping from states to actions (or probabilities of actions)

C.

The numerical value indicating the goodness of a state

D.

The mechanism that provides rewards to the agent

7 $Which equation represents the total discounted return ?$

A.

B.

C.

D.

8 $What does the State-Value Function represent?$

A.

The immediate reward received at state

B.

The probability of moving to state

C.

The expected return starting from state and following policy

D.

The maximum reward possible in the entire environment

9 $What is the Action-Value Function ?$

A.

The value of taking action in state and then following policy

B.

The value of being in state regardless of the action taken

C.

The probability of taking action in state

D.

The reward received immediately after taking action

10 $The Bellman Equation expresses a relationship between:$

A.

The value of a state and the value of its successor states

B.

The policy and the reward function only

C.

The learning rate and the discount factor

D.

The current observation and the previous observation

11 $In the Bellman Optimality Equation, which operator is used to define the optimal value?$

A.

Average

B.

Summation over time

C.

Max (Maximization over actions)

D.

Min (Minimization over costs)

12 $What is the Exploration vs. Exploitation trade-off?$

A.

Choosing between model-based and model-free learning

B.

Balancing between gathering new information and using known information to maximize reward

C.

Deciding whether to use a neural network or a tabular method

D.

Trading off computation time for memory usage

13 $Which method is commonly used to balance exploration and exploitation?$

A.

Gradient Descent

B.

-greedy (Epsilon-greedy)

C.

Backpropagation

D.

Principal Component Analysis

14 $What does it mean for an RL algorithm to be Model-Free ?$

A.

It builds an explicit model of the environment's transition dynamics

B.

It does not require knowledge of the transition probability or reward function

C.

It does not use any value functions

D.

It cannot solve MDPs

15 $What is Temporal Difference (TD) Learning ?$

A.

A method that waits until the end of an episode to update values

B.

A method that updates estimates based on other learned estimates without waiting for the outcome

C.

A supervised learning technique applied to RL

D.

A method that requires a complete model of the environment

16 $Which of the following is the TD(0) update rule for the state-value function ?$

A.

B.

C.

D.

17 $What is Bootstrapping in the context of TD learning?$

A.

Resampling the dataset to create more training data

B.

Updating a value estimate using another estimated value

C.

Initializing weights to zero

D.

Restarting the episode when the agent gets stuck

18 $Q-Learning is considered an Off-Policy algorithm. What does this mean?$

A.

It learns the value of the optimal policy while following a different exploratory policy

B.

It must follow the exact policy it is trying to learn

C.

It does not use a policy at all

D.

It requires the environment to be turned off during updates

19 $Which represents the Q-Learning update equation?$

A.

B.

C.

D.

20 $In the Q-Learning update rule, what is ?$

A.

Discount factor

B.

Learning rate

C.

Exploration probability

D.

Reward function

21 $If, the agent is:$

A.

Infinitely far-sighted

B.

Myopic (short-sighted)

C.

Random

D.

Optimal

22 $What is the key difference between Monte Carlo (MC) methods and TD Learning ?$

A.

MC can only be used for continuous states

B.

TD requires a model of the environment

C.

MC updates are performed only after a complete episode, while TD updates can happen at every step

D.

MC is biased while TD is unbiased

23 $Which of the following best describes the Credit Assignment Problem in RL?$

A.

Determining which past action is responsible for a current reward

B.

Assigning memory to store the Q-table

C.

Calculating the computational cost of the algorithm

D.

Distributing rewards among multiple agents

24 $In a tabular Q-learning approach, the Q-table has dimensions of:$

A.

Number of States Number of States

B.

Number of States Number of Actions

C.

Number of Actions Number of Rewards

D.

Number of Episodes Time Steps

25 $What is an Episodic Task ?$

A.

A task that continues forever without limit

B.

A task with a well-defined starting and ending point (terminal state)

C.

A task where the environment changes randomly

D.

A task that requires supervised training data

26 $What is a Continuing Task ?$

A.

A task that naturally breaks into episodes

B.

A task that goes on forever without a terminal state

C.

A task where rewards are always zero

D.

A task solvable only by Monte Carlo methods

27 $The Bellman Expectation Equation for can be written as:$

A.

B.

C.

D.

28 $What is the TD Error ()?$

A.

The difference between the predicted value and the actual target value

B.

The error in the reward function

C.

The difference between two consecutive rewards

D.

The probability of taking a wrong action

29 $Which algorithm is known as "on-policy" TD control?$

A.

Q-Learning

B.

SARSA

C.

Value Iteration

D.

Monte Carlo

30 $The SARSA update rule is given by:$

A.

B.

C.

D.

31 $If a problem has a continuous state space, which challenge arises for tabular Q-learning?$

A.

The rewards cannot be calculated

B.

The discount factor must be 1

C.

The Curse of Dimensionality (table becomes too large)

D.

The Markov property no longer holds

32 $A Deterministic Policy maps:$

A.

State to a probability distribution over actions

B.

State to a single action

C.

State to a reward value

D.

Action to a state

33 $The transition probability represents:$

A.

The probability of receiving a reward in state

B.

The probability of transitioning to state given state and action

C.

The probability of taking action in state

D.

The value of state

34 $Which of the following guarantees the convergence of Q-learning to the optimal ?$

A.

If the environment is deterministic only

B.

If all state-action pairs are visited infinitely often and the learning rate decays appropriately

C.

If the discount factor is exactly 1

D.

If the policy is strictly greedy

35 $What is the value of a Terminal State in an episodic task?$

A.

1

B.

Infinity

C.

0

D.

The last received reward

36 $What is the Prediction Problem in RL?$

A.

Finding the optimal policy

B.

Estimating the value function for a given policy

C.

Predicting the next state

D.

Predicting the immediate reward

37 $What is the Control Problem in RL?$

A.

Controlling the environment parameters

B.

Estimating the value of a fixed policy

C.

Finding the optimal policy that maximizes return

D.

Ensuring the agent does not crash

38 $In the context of the Bellman Equation, what does the term 'Recursive' mean?$

A.

The function calls itself

B.

The function is undefined

C.

The function depends on the previous time step only

D.

The function is linear

39 $Which of the following is a model-based algorithm?$

A.

Q-Learning

B.

Dynamic Programming (Policy Iteration)

C.

Monte Carlo

D.

SARSA

40 $Why do we use the max operator in Q-Learning?$

A.

To calculate the average reward

B.

To estimate the value of the best possible future action

C.

To ensure the agent explores

D.

To minimize the error

41 $In an MDP, if is Finite, is Finite, and dynamics are known, which technique can solve for the optimal policy exactly?$

A.

Dynamic Programming

B.

Random Search

C.

Linear Regression

D.

Clustering

42 $What is a Stochastic Policy ?$

A.

A policy that always chooses the same action for a given state

B.

A policy where actions are selected based on probabilities

C.

A policy that ignores the state

D.

A policy used only in deterministic environments

43 $In TD Learning, the term is known as:$

A.

The TD Error

B.

The TD Target

C.

The Return

D.

The Baseline

44 $Which of the following is NOT a component of the RL Agent-Environment interface?$

A.

Action

B.

State

C.

Reward

D.

Supervised Label

45 $If an agent always chooses the action with the highest estimated value, it is acting:$

A.

Stochastically

B.

Greedily

C.

Randomly

D.

Optimally (always guaranteed)

46 $What is the relationship between and ?$

A.

B.

C.

D.

47 $Which represents a purely delayed reward scenario?$

A.

Getting a point for every correct step

B.

Winning a game of Chess after many moves

C.

A thermostat adjusting every minute

D.

Receiving a salary every day

48 $In the equation, what does this represent?$

A.

A weighted average between the old estimate and the new information

B.

A complete replacement of the old value

C.

A sum of all past rewards

D.

The probability of the action

49 $What happens if the exploration rate in -greedy is set to 1?$

A.

The agent acts purely greedily

B.

The agent acts completely randomly

C.

The agent stops learning

D.

The agent alternates actions

50 $Generally, how does TD learning compare to Monte Carlo in terms of variance?$

A.

TD has higher variance

B.

TD has lower variance

C.

They have the same variance

D.

Variance is not a factor in RL

Unit 5 - Practice Quiz