1What is the primary goal of an agent in Reinforcement Learning?
A.To minimize the reconstruction error of the input data
B.To maximize the cumulative reward over time
C.To classify data into distinct categories based on labeled examples
D.To find hidden structures in unlabeled data
Correct Answer: To maximize the cumulative reward over time
Explanation:
In Reinforcement Learning, the agent interacts with an environment and attempts to learn a policy that maximizes the total amount of reward it receives over the long run.
Incorrect! Try again.
2Which of the following tuple representations correctly defines a Markov Decision Process (MDP)?
A.
B.
C.
D.
Correct Answer:
Explanation:
An MDP is typically defined by a tuple , where is the state space, is the action space, is the transition probability, is the reward function, and is the discount factor.
Incorrect! Try again.
3What does the Markov Property imply about the state of an environment?
A.The current state provides no information about the future
B.The future depends on the entire history of past states
C.The future is independent of the past given the present
D.The transition probabilities change over time
Correct Answer: The future is independent of the past given the present
Explanation:
The Markov Property states that the future state depends only on the current state and action, not on the sequence of events that preceded it. Mathematically: .
Incorrect! Try again.
4In the context of RL, what does the discount factor (gamma) control?
A.The learning rate of the agent
B.The probability of transitioning to a random state
C.The importance of immediate rewards versus future rewards
D.The exploration rate of the agent
Correct Answer: The importance of immediate rewards versus future rewards
Explanation:
The discount factor (where ) determines the present value of future rewards. A close to 0 makes the agent myopic (caring only about immediate rewards), while a close to 1 makes the agent far-sighted.
Incorrect! Try again.
5What distinguishes Reinforcement Learning from Supervised Learning?
A.RL is only used for continuous value prediction
B.RL maps inputs to outputs without any feedback
C.RL learns from interaction and delayed feedback (rewards) rather than explicit labels
D.RL relies on a static dataset with labeled targets
Correct Answer: RL learns from interaction and delayed feedback (rewards) rather than explicit labels
Explanation:
Unlike Supervised Learning, which is instructed with the correct answer (labels), RL discovers which actions yield the most reward by trying them (trial and error).
Incorrect! Try again.
6What is a Policy () in Reinforcement Learning?
A.The mechanism that provides rewards to the agent
B.The numerical value indicating the goodness of a state
C.A function that predicts the next state given the current state
D.A mapping from states to actions (or probabilities of actions)
Correct Answer: A mapping from states to actions (or probabilities of actions)
Explanation:
A policy defines the agent's behavior. It maps a given state to an action (deterministic) or a probability distribution over actions (stochastic).
Incorrect! Try again.
7Which equation represents the total discounted return ?
A.
B.
C.
D.
Correct Answer:
Explanation:
The return is the sum of discounted future rewards. discounts the reward received steps into the future.
Incorrect! Try again.
8What does the State-Value Function represent?
A.The probability of moving to state
B.The expected return starting from state and following policy
C.The maximum reward possible in the entire environment
D.The immediate reward received at state
Correct Answer: The expected return starting from state and following policy
Explanation:
. It tells us how good it is to be in a specific state under a given policy.
Incorrect! Try again.
9What is the Action-Value Function ?
A.The value of taking action in state and then following policy
B.The probability of taking action in state
C.The reward received immediately after taking action
D.The value of being in state regardless of the action taken
Correct Answer: The value of taking action in state and then following policy
Explanation:
. It evaluates the expected return of taking a specific action in a specific state and following the policy thereafter.
Incorrect! Try again.
10The Bellman Equation expresses a relationship between:
A.The policy and the reward function only
B.The learning rate and the discount factor
C.The value of a state and the value of its successor states
D.The current observation and the previous observation
Correct Answer: The value of a state and the value of its successor states
Explanation:
The Bellman Equation decomposes the value function into two parts: the immediate reward and the discounted value of the successor state(s). It provides a recursive definition.
Incorrect! Try again.
11In the Bellman Optimality Equation, which operator is used to define the optimal value?
A.Average
B.Min (Minimization over costs)
C.Max (Maximization over actions)
D.Summation over time
Correct Answer: Max (Maximization over actions)
Explanation:
The optimal value function assumes the agent always selects the action that maximizes the expected return. Thus, it involves .
Incorrect! Try again.
12What is the Exploration vs. Exploitation trade-off?
A.Deciding whether to use a neural network or a tabular method
B.Balancing between gathering new information and using known information to maximize reward
C.Choosing between model-based and model-free learning
D.Trading off computation time for memory usage
Correct Answer: Balancing between gathering new information and using known information to maximize reward
Explanation:
Exploration involves trying new actions to discover their rewards, while exploitation involves choosing the action currently known to yield the highest reward.
Incorrect! Try again.
13Which method is commonly used to balance exploration and exploitation?
A.-greedy (Epsilon-greedy)
B.Gradient Descent
C.Backpropagation
D.Principal Component Analysis
Correct Answer: -greedy (Epsilon-greedy)
Explanation:
In -greedy, the agent selects a random action with probability (exploration) and the best-known action with probability (exploitation).
Incorrect! Try again.
14What does it mean for an RL algorithm to be Model-Free?
A.It builds an explicit model of the environment's transition dynamics
B.It does not require knowledge of the transition probability or reward function
C.It cannot solve MDPs
D.It does not use any value functions
Correct Answer: It does not require knowledge of the transition probability or reward function
Explanation:
Model-free algorithms (like Q-Learning and TD Learning) learn directly from experience (samples of state, action, reward) without needing the environment's internal dynamics (model).
Incorrect! Try again.
15What is Temporal Difference (TD) Learning?
A.A method that waits until the end of an episode to update values
B.A method that updates estimates based on other learned estimates without waiting for the outcome
C.A supervised learning technique applied to RL
D.A method that requires a complete model of the environment
Correct Answer: A method that updates estimates based on other learned estimates without waiting for the outcome
Explanation:
TD learning combines Monte Carlo ideas (learning from experience) and Dynamic Programming ideas (bootstrapping). It updates the current value estimate towards a target that includes the estimated value of the next state.
Incorrect! Try again.
16Which of the following is the TD(0) update rule for the state-value function ?
A.
B.
C.
D.
Correct Answer:
Explanation:
This is the standard TD(0) update, where is the TD target and the term in brackets is the TD error.
Incorrect! Try again.
17What is Bootstrapping in the context of TD learning?
A.Resampling the dataset to create more training data
B.Updating a value estimate using another estimated value
C.Restarting the episode when the agent gets stuck
D.Initializing weights to zero
Correct Answer: Updating a value estimate using another estimated value
Explanation:
Bootstrapping refers to updating an estimate based on other estimates (e.g., using to update ) rather than waiting for the actual final return.
Incorrect! Try again.
18Q-Learning is considered an Off-Policy algorithm. What does this mean?
A.It must follow the exact policy it is trying to learn
B.It requires the environment to be turned off during updates
C.It does not use a policy at all
D.It learns the value of the optimal policy while following a different exploratory policy
Correct Answer: It learns the value of the optimal policy while following a different exploratory policy
Explanation:
Q-Learning approximates (the optimal action-value function) directly, regardless of the policy being followed to generate data (e.g., an -greedy policy).
Incorrect! Try again.
19Which represents the Q-Learning update equation?
A.
B.
C.
D.
Correct Answer:
Explanation:
Q-Learning uses the maximum estimated value of the next state () as the target, aiming directly for optimality.
Incorrect! Try again.
20In the Q-Learning update rule, what is ?
A.Discount factor
B.Learning rate
C.Exploration probability
D.Reward function
Correct Answer: Learning rate
Explanation:
(alpha) is the learning rate (step size), determining how much newly acquired information overrides old information.
Incorrect! Try again.
21If , the agent is:
A.Optimal
B.Myopic (short-sighted)
C.Random
D.Infinitely far-sighted
Correct Answer: Myopic (short-sighted)
Explanation:
When the discount factor is 0, future rewards are multiplied by 0. The agent only cares about maximizing the immediate reward .
Incorrect! Try again.
22What is the key difference between Monte Carlo (MC) methods and TD Learning?
A.TD requires a model of the environment
B.MC is biased while TD is unbiased
C.MC can only be used for continuous states
D.MC updates are performed only after a complete episode, while TD updates can happen at every step
Correct Answer: MC updates are performed only after a complete episode, while TD updates can happen at every step
Explanation:
Monte Carlo methods wait for the return to be known (end of episode), whereas TD methods bootstrap and update online.
Incorrect! Try again.
23Which of the following best describes the Credit Assignment Problem in RL?
A.Assigning memory to store the Q-table
B.Determining which past action is responsible for a current reward
C.Calculating the computational cost of the algorithm
D.Distributing rewards among multiple agents
Correct Answer: Determining which past action is responsible for a current reward
Explanation:
Because rewards can be delayed, it is difficult to determine exactly which action in a sequence of actions led to a specific positive or negative outcome.
Incorrect! Try again.
24In a tabular Q-learning approach, the Q-table has dimensions of:
A.Number of Actions Number of Rewards
B.Number of States Number of Actions
C.Number of Episodes Time Steps
D.Number of States Number of States
Correct Answer: Number of States Number of Actions
Explanation:
The Q-table stores a value for every state-action pair .
Incorrect! Try again.
25What is an Episodic Task?
A.A task where the environment changes randomly
B.A task that requires supervised training data
C.A task that continues forever without limit
D.A task with a well-defined starting and ending point (terminal state)
Correct Answer: A task with a well-defined starting and ending point (terminal state)
Explanation:
Episodic tasks break interaction into subsequences called episodes (e.g., a game of Chess), which end in a terminal state.
Incorrect! Try again.
26What is a Continuing Task?
A.A task where rewards are always zero
B.A task solvable only by Monte Carlo methods
C.A task that naturally breaks into episodes
D.A task that goes on forever without a terminal state
Correct Answer: A task that goes on forever without a terminal state
Explanation:
Continuing tasks (e.g., a thermostat controlling temperature) do not have a natural end point.
Incorrect! Try again.
27The Bellman Expectation Equation for can be written as:
A.
B.
C.
D.
Correct Answer:
Explanation:
This equation averages over the policy's action probabilities and the environment's transition dynamics .
Incorrect! Try again.
28What is the TD Error ()?
A.The difference between two consecutive rewards
B.The difference between the predicted value and the actual target value
C.The probability of taking a wrong action
D.The error in the reward function
Correct Answer: The difference between the predicted value and the actual target value
Explanation:
In TD(0), . It measures the surprise or difference between the current estimate and the better estimate.
Incorrect! Try again.
29Which algorithm is known as "on-policy" TD control?
A.Monte Carlo
B.SARSA
C.Q-Learning
D.Value Iteration
Correct Answer: SARSA
Explanation:
SARSA (State-Action-Reward-State-Action) updates using the action actually taken in the next state , meaning it learns the value of the policy it is currently following.
Incorrect! Try again.
30The SARSA update rule is given by:
A.
B.
C.
D.
Correct Answer:
Explanation:
SARSA uses the actual next action selected by the current policy, distinguishing it from Q-Learning which uses .
Incorrect! Try again.
31If a problem has a continuous state space, which challenge arises for tabular Q-learning?
A.The discount factor must be 1
B.The Curse of Dimensionality (table becomes too large)
C.The rewards cannot be calculated
D.The Markov property no longer holds
Correct Answer: The Curse of Dimensionality (table becomes too large)
Explanation:
With continuous states, the number of state-action pairs is infinite, making a discrete table impossible to store. Function approximation is needed.
Incorrect! Try again.
32A Deterministic Policy maps:
A.State to a probability distribution over actions
B.Action to a state
C.State to a reward value
D.State to a single action
Correct Answer: State to a single action
Explanation:
Mathematically, . It specifies exactly one action to take in each state.
Incorrect! Try again.
33The transition probability represents:
A.The probability of taking action in state
B.The probability of receiving a reward in state
C.The value of state
D.The probability of transitioning to state given state and action
Correct Answer: The probability of transitioning to state given state and action
Explanation:
This describes the dynamics of the environment.
Incorrect! Try again.
34Which of the following guarantees the convergence of Q-learning to the optimal ?
A.If the policy is strictly greedy
B.If the discount factor is exactly 1
C.If the environment is deterministic only
D.If all state-action pairs are visited infinitely often and the learning rate decays appropriately
Correct Answer: If all state-action pairs are visited infinitely often and the learning rate decays appropriately
Explanation:
Q-learning is proven to converge to the optimal action-value function with probability 1 under these conditions.
Incorrect! Try again.
35What is the value of a Terminal State in an episodic task?
A.The last received reward
B.1
C.0
D.Infinity
Correct Answer: 0
Explanation:
By definition, there are no future rewards after a terminal state, so its value is 0.
Incorrect! Try again.
36What is the Prediction Problem in RL?
A.Predicting the next state
B.Finding the optimal policy
C.Predicting the immediate reward
D.Estimating the value function for a given policy
Correct Answer: Estimating the value function for a given policy
Explanation:
Prediction (or Policy Evaluation) is the task of determining how good a specific policy is. The Control problem is finding the best policy.
Incorrect! Try again.
37What is the Control Problem in RL?
A.Controlling the environment parameters
B.Ensuring the agent does not crash
C.Finding the optimal policy that maximizes return
D.Estimating the value of a fixed policy
Correct Answer: Finding the optimal policy that maximizes return
Explanation:
Control involves improving the policy to find the optimal behavior.
Incorrect! Try again.
38In the context of the Bellman Equation, what does the term 'Recursive' mean?
A.The function calls itself
B.The function is linear
C.The function is undefined
D.The function depends on the previous time step only
Correct Answer: The function calls itself
Explanation:
The value of the current state is defined in terms of the value of the successor state ().
Incorrect! Try again.
39Which of the following is a model-based algorithm?
Dynamic Programming methods assume full knowledge of the MDP (transitions and rewards ), making them model-based.
Incorrect! Try again.
40Why do we use the max operator in Q-Learning?
A.To ensure the agent explores
B.To estimate the value of the best possible future action
C.To minimize the error
D.To calculate the average reward
Correct Answer: To estimate the value of the best possible future action
Explanation:
Q-Learning assumes that from the next state, the optimal action will be taken. This allows it to learn the optimal value .
Incorrect! Try again.
41In an MDP, if is Finite, is Finite, and dynamics are known, which technique can solve for the optimal policy exactly?
A.Random Search
B.Linear Regression
C.Dynamic Programming
D.Clustering
Correct Answer: Dynamic Programming
Explanation:
DP algorithms like Value Iteration and Policy Iteration can exactly solve finite MDPs with known dynamics.
Incorrect! Try again.
42What is a Stochastic Policy?
A.A policy that always chooses the same action for a given state
B.A policy that ignores the state
C.A policy used only in deterministic environments
D.A policy where actions are selected based on probabilities
Correct Answer: A policy where actions are selected based on probabilities
Explanation:
A stochastic policy defines , allowing for randomness in action selection.
Incorrect! Try again.
43In TD Learning, the term is known as:
A.The TD Error
B.The TD Target
C.The Return
D.The Baseline
Correct Answer: The TD Target
Explanation:
The update moves the current estimate towards this target value.
Incorrect! Try again.
44Which of the following is NOT a component of the RL Agent-Environment interface?
A.Supervised Label
B.Action
C.Reward
D.State
Correct Answer: Supervised Label
Explanation:
RL relies on rewards generated by the environment, not external supervised labels.
Incorrect! Try again.
45If an agent always chooses the action with the highest estimated value, it is acting:
A.Greedily
B.Randomly
C.Stochastically
D.Optimally (always guaranteed)
Correct Answer: Greedily
Explanation:
Greedy action selection exploits current knowledge. Note that acting greedily with respect to imperfect knowledge is not necessarily acting optimally.
Incorrect! Try again.
46What is the relationship between and ?
A.
B.
C.
D.
Correct Answer:
Explanation:
The value of a state under the optimal policy is equal to the value of the best action available in that state.
Incorrect! Try again.
47Which represents a purely delayed reward scenario?
A.Winning a game of Chess after many moves
B.A thermostat adjusting every minute
C.Getting a point for every correct step
D.Receiving a salary every day
Correct Answer: Winning a game of Chess after many moves
Explanation:
In Chess, rewards (win/loss) are typically only received at the very end of the game, making credit assignment difficult.
Incorrect! Try again.
48In the equation , what does this represent?
A.The probability of the action
B.A weighted average between the old estimate and the new information
C.A complete replacement of the old value
D.A sum of all past rewards
Correct Answer: A weighted average between the old estimate and the new information
Explanation:
The update rule is a form of exponential moving average, smoothing the estimate over time.
Incorrect! Try again.
49What happens if the exploration rate in -greedy is set to 1?
A.The agent acts completely randomly
B.The agent alternates actions
C.The agent acts purely greedily
D.The agent stops learning
Correct Answer: The agent acts completely randomly
Explanation:
If , the agent chooses a random action 100% of the time.
Incorrect! Try again.
50Generally, how does TD learning compare to Monte Carlo in terms of variance?
A.They have the same variance
B.TD has lower variance
C.Variance is not a factor in RL
D.TD has higher variance
Correct Answer: TD has lower variance
Explanation:
Because TD updates based on one step (or a few steps), it is less affected by the randomness of the entire remaining trajectory compared to MC, leading to lower variance (though potentially higher bias).