1What is the primary goal of an agent in Reinforcement Learning?
A.To minimize the error in prediction
B.To cluster similar data points together
C.To maximize the cumulative reward over time
D.To classify data into distinct categories
Correct Answer: To maximize the cumulative reward over time
Explanation:
In Reinforcement Learning, the agent interacts with an environment and aims to learn a policy that maximizes the total amount of reward it receives over the long run.
Incorrect! Try again.
2Which of the following elements is NOT a core component of a Reinforcement Learning system?
A.Agent
B.Reward Signal
C.Supervisor Labels
D.Environment
Correct Answer: Supervisor Labels
Explanation:
Reinforcement Learning is distinct from Supervised Learning because it relies on a reward signal from the environment rather than labeled input/output pairs provided by a supervisor.
Incorrect! Try again.
3In the context of RL, what does the 'Markov Property' imply about the state?
A.The future depends on the past history of all states.
B.The state transition is always deterministic.
C.The future depends only on the current state and action, not the history.
D.The state is independent of the actions taken.
Correct Answer: The future depends only on the current state and action, not the history.
Explanation:
A state signal has the Markov property if the current state captures all relevant information from the history, making the future independent of the past given the present.
Incorrect! Try again.
4What does a 'Policy' represent in Reinforcement Learning?
A.The immediate reward received after an action
B.The calculation of total future reward
C.The probability of moving from one state to another
D.A mapping from perceived states to actions to be taken
Correct Answer: A mapping from perceived states to actions to be taken
Explanation:
A policy defines the agent's behavior. It maps states of the environment to the actions the agent should take when in those states.
Incorrect! Try again.
5In an MDP, what does the discount factor (gamma, γ) determine?
A.The magnitude of the transition probability
B.The importance of future rewards relative to immediate rewards
C.The learning rate of the algorithm
D.The probability of choosing a random action
Correct Answer: The importance of future rewards relative to immediate rewards
Explanation:
The discount factor (0 <= γ <= 1) determines the present value of future rewards. A value close to 0 makes the agent shortsighted, while a value close to 1 makes it farsighted.
Incorrect! Try again.
6Which tuple represents a finite Markov Decision Process (MDP)?
A.(S, A, R, γ)
B.(S, A, P, R)
C.(S, A, P, R, γ)
D.(S, P, R, γ)
Correct Answer: (S, A, P, R, γ)
Explanation:
An MDP is typically defined by a tuple containing States (S), Actions (A), Transition Probabilities (P), Reward Function (R), and a Discount Factor (γ).
Incorrect! Try again.
7What is the difference between a Value Function V(s) and an Action-Value Function Q(s, a)?
A.There is no mathematical difference.
B.V(s) is for continuous spaces, Q(s, a) is for discrete spaces.
C.V(s) estimates the return of a state, while Q(s, a) estimates the return of taking an action in a state.
D.V(s) includes the action taken, while Q(s, a) does not.
Correct Answer: V(s) estimates the return of a state, while Q(s, a) estimates the return of taking an action in a state.
Explanation:
V(s) (State-Value) predicts the expected return starting from state s. Q(s, a) (Action-Value) predicts the expected return starting from state s, taking action a.
Incorrect! Try again.
8The Bellman Equation expresses the relationship between:
A.The agent and the environment
B.The exploration rate and the exploitation rate
C.The current reward and the previous reward
D.The value of a state and the values of its successor states
Correct Answer: The value of a state and the values of its successor states
Explanation:
The Bellman Equation is a recursive definition that decomposes the value function into the immediate reward plus the discounted value of the next state.
Incorrect! Try again.
9Which method requires the completion of an entire episode before updating the value estimates?
A.Monte Carlo Learning
B.Temporal Difference Learning
C.Q-Learning
D.Dynamic Programming
Correct Answer: Monte Carlo Learning
Explanation:
Monte Carlo methods wait until the end of an episode to calculate the actual return (sum of rewards) and then update the value estimates based on that return.
Incorrect! Try again.
10What is 'Bootstrapping' in the context of Temporal Difference (TD) learning?
A.Running multiple episodes in parallel
B.Updating an estimate based on another estimate
C.Using random weights for initialization
D.Restarting the learning process from scratch
Correct Answer: Updating an estimate based on another estimate
Explanation:
TD learning bootstraps because it updates its guess for the value of a state based on its guess for the value of the next state, rather than waiting for the final actual return.
Incorrect! Try again.
11In the Exploration vs. Exploitation trade-off, what does 'Exploitation' refer to?
A.Trying new actions to find better rewards
B.Ignoring the reward signal
C.Choosing the action currently believed to be the best
D.Randomly selecting actions
Correct Answer: Choosing the action currently believed to be the best
Explanation:
Exploitation involves maximizing current expected reward by selecting the action that the agent already knows yields the highest value.
Incorrect! Try again.
12What is the Epsilon-Greedy strategy?
A.Always choosing a random action
B.Always choosing the best action
C.Choosing the action with the lowest value
D.Choosing the best action most of the time, but a random action with probability epsilon
Correct Answer: Choosing the best action most of the time, but a random action with probability epsilon
Explanation:
Epsilon-Greedy balances exploration and exploitation. With probability 1-epsilon it exploits (greedy), and with probability epsilon it explores (random action).
Incorrect! Try again.
13Which equation represents the Bellman Optimality Equation for V*(s)?
The Bellman Optimality Equation states that the value of a state under an optimal policy must equal the expected return for the best action from that state.
Incorrect! Try again.
14In Monte Carlo learning, what is the difference between 'First-visit' and 'Every-visit' MC?
A.First-visit is faster; Every-visit is slower.
B.First-visit updates only the first time a state is visited in an episode; Every-visit updates for all visits.
C.Every-visit is for continuous tasks; First-visit is for episodic tasks.
D.First-visit uses bootstrapping; Every-visit does not.
Correct Answer: First-visit updates only the first time a state is visited in an episode; Every-visit updates for all visits.
Explanation:
In an episode, a state might be visited multiple times. First-visit MC averages returns only from the first occurrence, while Every-visit MC averages returns following all visits to that state.
The TD(0) update moves the current value estimate V(s) toward the TD target (R + γV(s')), where alpha is the learning rate.
Incorrect! Try again.
16Which of the following describes a 'Model-Free' RL approach?
A.The agent plans by simulating future states.
B.The agent learns a policy or value function directly from experience without knowing the environment's dynamics.
C.The agent requires a supervisor to model the environment.
D.The agent learns the transition probabilities and reward function explicitly.
Correct Answer: The agent learns a policy or value function directly from experience without knowing the environment's dynamics.
Explanation:
Model-free methods (like MC and TD) do not require a model of the environment (transition probabilities P and rewards R) to learn; they rely on trial-and-error experience.
Incorrect! Try again.
17What is the return (Gt) in Reinforcement Learning?
A.The total discounted sum of future rewards
B.The immediate reward received
C.The final reward at the terminal state
D.The average reward of the episode
Correct Answer: The total discounted sum of future rewards
Explanation:
The return Gt is the goal of the agent, calculated as the sum of discounted rewards from time step t until the end of the episode.
Incorrect! Try again.
18If the discount factor γ is 0, the agent is:
A.Myopic (cares only about immediate reward)
B.Far-sighted (cares only about long-term reward)
C.Random
D.Optimal
Correct Answer: Myopic (cares only about immediate reward)
Explanation:
If γ=0, the return Gt equals the immediate reward Rt, meaning the agent ignores all future consequences and focuses solely on the current step.
Incorrect! Try again.
19What is an 'Episodic Task'?
A.A task that breaks interaction into subsequences called episodes which end in a terminal state
B.A task with only one state
C.A task that continues indefinitely without end
D.A task where the reward is always zero
Correct Answer: A task that breaks interaction into subsequences called episodes which end in a terminal state
Explanation:
Episodic tasks have a defined starting point and a terminal state (like a game of chess). Once the terminal state is reached, the episode ends and the system resets.
Incorrect! Try again.
20Comparing MC and TD methods, which statement is true regarding variance and bias?
A.MC has low variance, high bias.
B.MC has high variance, zero bias; TD has low variance, some bias.
C.TD has high variance, low bias.
D.MC and TD have identical variance and bias properties.
Correct Answer: MC has high variance, zero bias; TD has low variance, some bias.
Explanation:
MC is unbiased because it uses actual returns, but has high variance due to stochasticity over a full episode. TD introduces bias via bootstrapping but has lower variance.
Incorrect! Try again.
21In the context of the Bellman Equation, what is p(s', r | s, a)?
A.The discount factor
B.The value function
C.The dynamics function (probability of next state and reward)
D.The policy function
Correct Answer: The dynamics function (probability of next state and reward)
Explanation:
This term represents the joint probability of transitioning to state s' and receiving reward r, given that the agent takes action a in state s.
Incorrect! Try again.
22What is a 'Deterministic Policy'?
A.A policy that maps a state to a specific, single action
B.A policy that maps a state to a probability distribution over actions
C.A policy that changes over time
D.A policy that ignores the state
Correct Answer: A policy that maps a state to a specific, single action
Explanation:
A deterministic policy denotes that for every state s, there is exactly one action a = π(s) that the agent will take.
Incorrect! Try again.
23Which learning method performs updates step-by-step without waiting for the episode to end?
A.Batch Learning
B.Exhaustive Search
C.Monte Carlo
D.Temporal Difference
Correct Answer: Temporal Difference
Explanation:
Temporal Difference (TD) learning updates estimates after every time step (or n steps) using the immediate reward and the estimate of the next state.
Incorrect! Try again.
24The term 'Greedy Action' implies:
A.Selecting the action with the highest estimated value
B.Selecting the action with the lowest cost
C.Selecting an action that maximizes exploration
D.Selecting a random action
Correct Answer: Selecting the action with the highest estimated value
Explanation:
A greedy action is one that exploits current knowledge by choosing the action associated with the maximum Q-value for the current state.
Incorrect! Try again.
25What is the role of the Value Function?
A.To define the rules of the environment
B.To predict how good it is to be in a specific state
C.To generate random numbers
D.To store the immediate reward
Correct Answer: To predict how good it is to be in a specific state
Explanation:
Value functions quantify the 'goodness' of a state (or state-action pair) defined as the expected future reward achievable from that state.
Incorrect! Try again.
26Which of the following is NOT a challenge in Reinforcement Learning?
A.Delayed Reward
B.Credit Assignment Problem
C.Availability of labeled training data
D.Exploration vs Exploitation
Correct Answer: Availability of labeled training data
Explanation:
Labeled training data is a requirement for Supervised Learning, not RL. RL agents learn from scalar reward signals, not correct answer labels.
Incorrect! Try again.
27In the equation Gt = R{t+1} + γR{t+2} + γ^2R{t+3} + ... , what is G_t?
A.The transition probability
B.The value function
C.The policy
D.The discounted return
Correct Answer: The discounted return
Explanation:
G_t represents the sum of discounted future rewards starting from time t, known as the return.
Incorrect! Try again.
28Why is exploration necessary in Reinforcement Learning?
A.To avoid overfitting
B.To speed up the calculation of the Bellman equation
C.To discover states and actions that might yield higher rewards than the current best known options
D.To minimize the discount factor
Correct Answer: To discover states and actions that might yield higher rewards than the current best known options
Explanation:
If an agent only exploits, it may get stuck in a suboptimal solution. Exploration ensures the agent gathers enough information about the environment to find the global optimum.
Incorrect! Try again.
29Which of the following is an Off-Policy control method?
A.SARSA
B.Monte Carlo Policy Evaluation
C.Standard TD Prediction
D.Q-Learning
Correct Answer: Q-Learning
Explanation:
Q-Learning is off-policy because it learns the value of the optimal policy (using max over actions) independently of the agent's actual actions (which might be epsilon-greedy).
Incorrect! Try again.
30What is the 'Credit Assignment Problem' in RL?
A.Assigning monetary value to states
B.Deciding how much memory to allocate
C.Determining which past action is responsible for a current reward
D.Calculating the computational cost of the algorithm
Correct Answer: Determining which past action is responsible for a current reward
Explanation:
Because rewards are often delayed, it is difficult to determine exactly which action in a long sequence caused the positive or negative outcome.
Incorrect! Try again.
31An optimal policy π* is defined as:
A.A policy with zero discount factor
B.A policy that explores every state
C.A policy that is better than or equal to all other policies
D.A policy that reaches the terminal state fastest
Correct Answer: A policy that is better than or equal to all other policies
Explanation:
A policy π is optimal if its expected return Vπ(s) is greater than or equal to Vπ(s) for all states s and all other policies π.
Incorrect! Try again.
32In Q-Learning, the target value for the update is:
A.R + γ Q(s', a')
B.R + γ max_a' Q(s', a')
C.V(s')
D.The actual return Gt
Correct Answer: R + γ max_a' Q(s', a')
Explanation:
Q-Learning updates the Q-value towards the immediate reward plus the discounted value of the best possible action in the next state (greedy target).
Incorrect! Try again.
33Monte Carlo methods are applicable only to:
A.Tasks with known models
B.Deterministic environments
C.Continuous tasks
D.Episodic tasks
Correct Answer: Episodic tasks
Explanation:
Monte Carlo methods require the episode to terminate so that the final return can be calculated and used for updates.
Incorrect! Try again.
34What does SARSA stand for?
A.System-Action-Reward-System-Action
B.State-Action-Return-State-Average
C.State-Action-Reward-State-Action
D.Search-And-Retrieve-Sorted-Arrays
Correct Answer: State-Action-Reward-State-Action
Explanation:
SARSA represents the quintuple (S_t, At, R{t+1}, S{t+1}, A{t+1}) used in the update rule for this on-policy TD control method.
Incorrect! Try again.
35When does the 'Optimistic Initial Values' technique encourage exploration?
A.When initial value estimates are set very low
B.When initial value estimates are set higher than the expected maximum reward
C.When the discount factor is 1
D.When epsilon is set to 0
Correct Answer: When initial value estimates are set higher than the expected maximum reward
Explanation:
If values start high, the agent is 'disappointed' by actual rewards (which are lower) and tries other actions to find the supposed high rewards, thus exploring.
Incorrect! Try again.
36Which Bellman equation is linear?
A.Bellman Expectation Equation
B.Bellman Optimality Equation
C.Both
D.Neither
Correct Answer: Bellman Expectation Equation
Explanation:
The Bellman Expectation Equation for a fixed policy is a system of linear equations. The Optimality Equation contains a 'max' operator, making it non-linear.
Incorrect! Try again.
37What is the main advantage of TD learning over Monte Carlo?
A.It works better for non-Markov environments
B.It can learn online during an episode
C.It is unbiased
D.It requires less memory
Correct Answer: It can learn online during an episode
Explanation:
TD learning can update values after every step, making it suitable for continuous tasks or very long episodes, whereas MC must wait for the episode to end.
Incorrect! Try again.
38The sequence of states and actions S0, A0, R1, S1, A1, R2... is called:
A.A Value Function
B.A Policy
C.A Model
D.A Trajectory
Correct Answer: A Trajectory
Explanation:
A trajectory (or history) is the sequence of states, actions, and rewards encountered by the agent as it interacts with the environment.
Incorrect! Try again.
39In a stochastic environment:
A.Taking an action leads to a next state based on a probability distribution
B.Taking an action always leads to the same next state
C.The agent cannot learn
D.Rewards are not provided
Correct Answer: Taking an action leads to a next state based on a probability distribution
Explanation:
Stochasticity means there is randomness in the transitions; doing action A in state S does not guarantee arriving at state S'.
Incorrect! Try again.
40Which algorithm is considered 'On-Policy'?
A.Q-Learning
B.Max-Q
C.Off-Policy MC
D.SARSA
Correct Answer: SARSA
Explanation:
SARSA is on-policy because it updates the Q-values based on the action actually taken by the current policy (including exploratory steps).
Incorrect! Try again.
41The quantity R + γV(s') is often called the:
A.Monte Carlo Return
B.TD Target
C.TD Error
D.Exploration Bonus
Correct Answer: TD Target
Explanation:
In TD learning, the estimate is updated towards this value, which acts as the target for the prediction.
Incorrect! Try again.
42If an agent uses a pure Greedy strategy (epsilon=0), it:
A.Alternates between exploration and exploitation
B.Explores 50% of the time
C.Explores randomly
D.Never explores
Correct Answer: Never explores
Explanation:
A pure greedy strategy always picks the current best-known action, never trying new actions to see if they are better (zero exploration).
Incorrect! Try again.
43The State-Value function V_π(s) is the expected return starting from state s and then following:
A.The greedy policy
B.A random policy
C.Policy π
D.The optimal policy
Correct Answer: Policy π
Explanation:
V_π(s) specifically evaluates the expected return if the agent behaves according to the specific policy π.
Incorrect! Try again.
44Dynamic Programming (DP) methods in RL assume:
A.Monte Carlo sampling is used
B.The environment is unknown
C.A perfect model of the environment is available
D.Rewards are always positive
Correct Answer: A perfect model of the environment is available
Explanation:
DP algorithms (like Policy Iteration and Value Iteration) require knowledge of the transition probabilities and reward functions (the model) to compute values.
Incorrect! Try again.
45What is 'Policy Improvement'?
A.Calculating the value function for a policy
B.Making a new policy that is greedy with respect to the current value function
C.Increasing the learning rate
D.Collecting more data
Correct Answer: Making a new policy that is greedy with respect to the current value function
Explanation:
Policy Improvement generates a better policy π' by acting greedily with respect to the value function of the current policy π.
Incorrect! Try again.
46Upper Confidence Bound (UCB) is an algorithm used to handle:
A.Discount Factors
B.Continuous State Spaces
C.The Bellman Equation
D.The Exploration-Exploitation Dilemma
Correct Answer: The Exploration-Exploitation Dilemma
Explanation:
UCB selects actions based on their estimated value plus a confidence interval term, encouraging exploration of actions with uncertain values.
Incorrect! Try again.
47In the TD error equation δ = R + γV(s') - V(s), what does δ represent?
A.The total return
B.The difference between the target and the current estimate
C.The learning rate
D.The probability of the next state
Correct Answer: The difference between the target and the current estimate
Explanation:
The TD error (delta) measures the surprise or difference between the improved estimate (target) and the current estimate.
Incorrect! Try again.
48A key distinction between Reinforcement Learning and Unsupervised Learning is:
A.RL maximizes a reward signal, Unsupervised Learning finds hidden structure