1What is the primary goal of an agent in Reinforcement Learning?
A.To classify data into distinct categories based on labeled examples
B.To maximize the cumulative reward over time
C.To minimize the reconstruction error of the input data
D.To find hidden structures in unlabeled data
Correct Answer: To maximize the cumulative reward over time
Explanation:In Reinforcement Learning, the agent interacts with an environment and attempts to learn a policy that maximizes the total amount of reward it receives over the long run.
Incorrect! Try again.
2Which of the following tuple representations correctly defines a Markov Decision Process (MDP)?
A.
B.
C.
D.
Correct Answer:
Explanation:An MDP is typically defined by a tuple , where is the state space, is the action space, is the transition probability, is the reward function, and is the discount factor.
Incorrect! Try again.
3What does the Markov Property imply about the state of an environment?
A.The future is independent of the past given the present
B.The future depends on the entire history of past states
C.The current state provides no information about the future
D.The transition probabilities change over time
Correct Answer: The future is independent of the past given the present
Explanation:The Markov Property states that the future state depends only on the current state and action, not on the sequence of events that preceded it. Mathematically: .
Incorrect! Try again.
4In the context of RL, what does the discount factor (gamma) control?
A.The learning rate of the agent
B.The exploration rate of the agent
C.The importance of immediate rewards versus future rewards
D.The probability of transitioning to a random state
Correct Answer: The importance of immediate rewards versus future rewards
Explanation:The discount factor (where ) determines the present value of future rewards. A close to 0 makes the agent myopic (caring only about immediate rewards), while a close to 1 makes the agent far-sighted.
Incorrect! Try again.
5What distinguishes Reinforcement Learning from Supervised Learning?
A.RL relies on a static dataset with labeled targets
B.RL maps inputs to outputs without any feedback
C.RL learns from interaction and delayed feedback (rewards) rather than explicit labels
D.RL is only used for continuous value prediction
Correct Answer: RL learns from interaction and delayed feedback (rewards) rather than explicit labels
Explanation:Unlike Supervised Learning, which is instructed with the correct answer (labels), RL discovers which actions yield the most reward by trying them (trial and error).
Incorrect! Try again.
6What is a Policy () in Reinforcement Learning?
A.A function that predicts the next state given the current state
B.A mapping from states to actions (or probabilities of actions)
C.The numerical value indicating the goodness of a state
D.The mechanism that provides rewards to the agent
Correct Answer: A mapping from states to actions (or probabilities of actions)
Explanation:A policy defines the agent's behavior. It maps a given state to an action (deterministic) or a probability distribution over actions (stochastic).
Incorrect! Try again.
7Which equation represents the total discounted return ?
A.
B.
C.
D.
Correct Answer:
Explanation:The return is the sum of discounted future rewards. discounts the reward received steps into the future.
Incorrect! Try again.
8What does the State-Value Function represent?
A.The immediate reward received at state
B.The probability of moving to state
C.The expected return starting from state and following policy
D.The maximum reward possible in the entire environment
Correct Answer: The expected return starting from state and following policy
Explanation:. It tells us how good it is to be in a specific state under a given policy.
Incorrect! Try again.
9What is the Action-Value Function ?
A.The value of taking action in state and then following policy
B.The value of being in state regardless of the action taken
C.The probability of taking action in state
D.The reward received immediately after taking action
Correct Answer: The value of taking action in state and then following policy
Explanation:. It evaluates the expected return of taking a specific action in a specific state and following the policy thereafter.
Incorrect! Try again.
10The Bellman Equation expresses a relationship between:
A.The value of a state and the value of its successor states
B.The policy and the reward function only
C.The learning rate and the discount factor
D.The current observation and the previous observation
Correct Answer: The value of a state and the value of its successor states
Explanation:The Bellman Equation decomposes the value function into two parts: the immediate reward and the discounted value of the successor state(s). It provides a recursive definition.
Incorrect! Try again.
11In the Bellman Optimality Equation, which operator is used to define the optimal value?
A.Average
B.Summation over time
C.Max (Maximization over actions)
D.Min (Minimization over costs)
Correct Answer: Max (Maximization over actions)
Explanation:The optimal value function assumes the agent always selects the action that maximizes the expected return. Thus, it involves .
Incorrect! Try again.
12What is the Exploration vs. Exploitation trade-off?
A.Choosing between model-based and model-free learning
B.Balancing between gathering new information and using known information to maximize reward
C.Deciding whether to use a neural network or a tabular method
D.Trading off computation time for memory usage
Correct Answer: Balancing between gathering new information and using known information to maximize reward
Explanation:Exploration involves trying new actions to discover their rewards, while exploitation involves choosing the action currently known to yield the highest reward.
Incorrect! Try again.
13Which method is commonly used to balance exploration and exploitation?
A.Gradient Descent
B.-greedy (Epsilon-greedy)
C.Backpropagation
D.Principal Component Analysis
Correct Answer: -greedy (Epsilon-greedy)
Explanation:In -greedy, the agent selects a random action with probability (exploration) and the best-known action with probability (exploitation).
Incorrect! Try again.
14What does it mean for an RL algorithm to be Model-Free?
A.It builds an explicit model of the environment's transition dynamics
B.It does not require knowledge of the transition probability or reward function
C.It does not use any value functions
D.It cannot solve MDPs
Correct Answer: It does not require knowledge of the transition probability or reward function
Explanation:Model-free algorithms (like Q-Learning and TD Learning) learn directly from experience (samples of state, action, reward) without needing the environment's internal dynamics (model).
Incorrect! Try again.
15What is Temporal Difference (TD) Learning?
A.A method that waits until the end of an episode to update values
B.A method that updates estimates based on other learned estimates without waiting for the outcome
C.A supervised learning technique applied to RL
D.A method that requires a complete model of the environment
Correct Answer: A method that updates estimates based on other learned estimates without waiting for the outcome
Explanation:TD learning combines Monte Carlo ideas (learning from experience) and Dynamic Programming ideas (bootstrapping). It updates the current value estimate towards a target that includes the estimated value of the next state.
Incorrect! Try again.
16Which of the following is the TD(0) update rule for the state-value function ?
A.
B.
C.
D.
Correct Answer:
Explanation:This is the standard TD(0) update, where is the TD target and the term in brackets is the TD error.
Incorrect! Try again.
17What is Bootstrapping in the context of TD learning?
A.Resampling the dataset to create more training data
B.Updating a value estimate using another estimated value
C.Initializing weights to zero
D.Restarting the episode when the agent gets stuck
Correct Answer: Updating a value estimate using another estimated value
Explanation:Bootstrapping refers to updating an estimate based on other estimates (e.g., using to update ) rather than waiting for the actual final return.
Incorrect! Try again.
18Q-Learning is considered an Off-Policy algorithm. What does this mean?
A.It learns the value of the optimal policy while following a different exploratory policy
B.It must follow the exact policy it is trying to learn
C.It does not use a policy at all
D.It requires the environment to be turned off during updates
Correct Answer: It learns the value of the optimal policy while following a different exploratory policy
Explanation:Q-Learning approximates (the optimal action-value function) directly, regardless of the policy being followed to generate data (e.g., an -greedy policy).
Incorrect! Try again.
19Which represents the Q-Learning update equation?
A.
B.
C.
D.
Correct Answer:
Explanation:Q-Learning uses the maximum estimated value of the next state () as the target, aiming directly for optimality.
Incorrect! Try again.
20In the Q-Learning update rule, what is ?
A.Discount factor
B.Learning rate
C.Exploration probability
D.Reward function
Correct Answer: Learning rate
Explanation: (alpha) is the learning rate (step size), determining how much newly acquired information overrides old information.
Incorrect! Try again.
21If , the agent is:
A.Infinitely far-sighted
B.Myopic (short-sighted)
C.Random
D.Optimal
Correct Answer: Myopic (short-sighted)
Explanation:When the discount factor is 0, future rewards are multiplied by 0. The agent only cares about maximizing the immediate reward .
Incorrect! Try again.
22What is the key difference between Monte Carlo (MC) methods and TD Learning?
A.MC can only be used for continuous states
B.TD requires a model of the environment
C.MC updates are performed only after a complete episode, while TD updates can happen at every step
D.MC is biased while TD is unbiased
Correct Answer: MC updates are performed only after a complete episode, while TD updates can happen at every step
Explanation:Monte Carlo methods wait for the return to be known (end of episode), whereas TD methods bootstrap and update online.
Incorrect! Try again.
23Which of the following best describes the Credit Assignment Problem in RL?
A.Determining which past action is responsible for a current reward
B.Assigning memory to store the Q-table
C.Calculating the computational cost of the algorithm
D.Distributing rewards among multiple agents
Correct Answer: Determining which past action is responsible for a current reward
Explanation:Because rewards can be delayed, it is difficult to determine exactly which action in a sequence of actions led to a specific positive or negative outcome.
Incorrect! Try again.
24In a tabular Q-learning approach, the Q-table has dimensions of:
A.Number of States Number of States
B.Number of States Number of Actions
C.Number of Actions Number of Rewards
D.Number of Episodes Time Steps
Correct Answer: Number of States Number of Actions
Explanation:The Q-table stores a value for every state-action pair .
Incorrect! Try again.
25What is an Episodic Task?
A.A task that continues forever without limit
B.A task with a well-defined starting and ending point (terminal state)
C.A task where the environment changes randomly
D.A task that requires supervised training data
Correct Answer: A task with a well-defined starting and ending point (terminal state)
Explanation:Episodic tasks break interaction into subsequences called episodes (e.g., a game of Chess), which end in a terminal state.
Incorrect! Try again.
26What is a Continuing Task?
A.A task that naturally breaks into episodes
B.A task that goes on forever without a terminal state
C.A task where rewards are always zero
D.A task solvable only by Monte Carlo methods
Correct Answer: A task that goes on forever without a terminal state
Explanation:Continuing tasks (e.g., a thermostat controlling temperature) do not have a natural end point.
Incorrect! Try again.
27The Bellman Expectation Equation for can be written as:
A.
B.
C.
D.
Correct Answer:
Explanation:This equation averages over the policy's action probabilities and the environment's transition dynamics .
Incorrect! Try again.
28What is the TD Error ()?
A.The difference between the predicted value and the actual target value
B.The error in the reward function
C.The difference between two consecutive rewards
D.The probability of taking a wrong action
Correct Answer: The difference between the predicted value and the actual target value
Explanation:In TD(0), . It measures the surprise or difference between the current estimate and the better estimate.
Incorrect! Try again.
29Which algorithm is known as "on-policy" TD control?
A.Q-Learning
B.SARSA
C.Value Iteration
D.Monte Carlo
Correct Answer: SARSA
Explanation:SARSA (State-Action-Reward-State-Action) updates using the action actually taken in the next state , meaning it learns the value of the policy it is currently following.
Incorrect! Try again.
30The SARSA update rule is given by:
A.
B.
C.
D.
Correct Answer:
Explanation:SARSA uses the actual next action selected by the current policy, distinguishing it from Q-Learning which uses .
Incorrect! Try again.
31If a problem has a continuous state space, which challenge arises for tabular Q-learning?
A.The rewards cannot be calculated
B.The discount factor must be 1
C.The Curse of Dimensionality (table becomes too large)
D.The Markov property no longer holds
Correct Answer: The Curse of Dimensionality (table becomes too large)
Explanation:With continuous states, the number of state-action pairs is infinite, making a discrete table impossible to store. Function approximation is needed.
Incorrect! Try again.
32A Deterministic Policy maps:
A.State to a probability distribution over actions
B.State to a single action
C.State to a reward value
D.Action to a state
Correct Answer: State to a single action
Explanation:Mathematically, . It specifies exactly one action to take in each state.
Incorrect! Try again.
33The transition probability represents:
A.The probability of receiving a reward in state
B.The probability of transitioning to state given state and action
C.The probability of taking action in state
D.The value of state
Correct Answer: The probability of transitioning to state given state and action
Explanation:This describes the dynamics of the environment.
Incorrect! Try again.
34Which of the following guarantees the convergence of Q-learning to the optimal ?
A.If the environment is deterministic only
B.If all state-action pairs are visited infinitely often and the learning rate decays appropriately
C.If the discount factor is exactly 1
D.If the policy is strictly greedy
Correct Answer: If all state-action pairs are visited infinitely often and the learning rate decays appropriately
Explanation:Q-learning is proven to converge to the optimal action-value function with probability 1 under these conditions.
Incorrect! Try again.
35What is the value of a Terminal State in an episodic task?
A.1
B.Infinity
C.
D.The last received reward
Correct Answer:
Explanation:By definition, there are no future rewards after a terminal state, so its value is 0.
Incorrect! Try again.
36What is the Prediction Problem in RL?
A.Finding the optimal policy
B.Estimating the value function for a given policy
C.Predicting the next state
D.Predicting the immediate reward
Correct Answer: Estimating the value function for a given policy
Explanation:Prediction (or Policy Evaluation) is the task of determining how good a specific policy is. The Control problem is finding the best policy.
Incorrect! Try again.
37What is the Control Problem in RL?
A.Controlling the environment parameters
B.Estimating the value of a fixed policy
C.Finding the optimal policy that maximizes return
D.Ensuring the agent does not crash
Correct Answer: Finding the optimal policy that maximizes return
Explanation:Control involves improving the policy to find the optimal behavior.
Incorrect! Try again.
38In the context of the Bellman Equation, what does the term 'Recursive' mean?
A.The function calls itself
B.The function is undefined
C.The function depends on the previous time step only
D.The function is linear
Correct Answer: The function calls itself
Explanation:The value of the current state is defined in terms of the value of the successor state ().
Incorrect! Try again.
39Which of the following is a model-based algorithm?
Explanation:Dynamic Programming methods assume full knowledge of the MDP (transitions and rewards ), making them model-based.
Incorrect! Try again.
40Why do we use the max operator in Q-Learning?
A.To calculate the average reward
B.To estimate the value of the best possible future action
C.To ensure the agent explores
D.To minimize the error
Correct Answer: To estimate the value of the best possible future action
Explanation:Q-Learning assumes that from the next state, the optimal action will be taken. This allows it to learn the optimal value .
Incorrect! Try again.
41In an MDP, if is Finite, is Finite, and dynamics are known, which technique can solve for the optimal policy exactly?
A.Dynamic Programming
B.Random Search
C.Linear Regression
D.Clustering
Correct Answer: Dynamic Programming
Explanation:DP algorithms like Value Iteration and Policy Iteration can exactly solve finite MDPs with known dynamics.
Incorrect! Try again.
42What is a Stochastic Policy?
A.A policy that always chooses the same action for a given state
B.A policy where actions are selected based on probabilities
C.A policy that ignores the state
D.A policy used only in deterministic environments
Correct Answer: A policy where actions are selected based on probabilities
Explanation:A stochastic policy defines , allowing for randomness in action selection.
Incorrect! Try again.
43In TD Learning, the term is known as:
A.The TD Error
B.The TD Target
C.The Return
D.The Baseline
Correct Answer: The TD Target
Explanation:The update moves the current estimate towards this target value.
Incorrect! Try again.
44Which of the following is NOT a component of the RL Agent-Environment interface?
A.Action
B.State
C.Reward
D.Supervised Label
Correct Answer: Supervised Label
Explanation:RL relies on rewards generated by the environment, not external supervised labels.
Incorrect! Try again.
45If an agent always chooses the action with the highest estimated value, it is acting:
A.Stochastically
B.Greedily
C.Randomly
D.Optimally (always guaranteed)
Correct Answer: Greedily
Explanation:Greedy action selection exploits current knowledge. Note that acting greedily with respect to imperfect knowledge is not necessarily acting optimally.
Incorrect! Try again.
46What is the relationship between and ?
A.
B.
C.
D.
Correct Answer:
Explanation:The value of a state under the optimal policy is equal to the value of the best action available in that state.
Incorrect! Try again.
47Which represents a purely delayed reward scenario?
A.Getting a point for every correct step
B.Winning a game of Chess after many moves
C.A thermostat adjusting every minute
D.Receiving a salary every day
Correct Answer: Winning a game of Chess after many moves
Explanation:In Chess, rewards (win/loss) are typically only received at the very end of the game, making credit assignment difficult.
Incorrect! Try again.
48In the equation , what does this represent?
A.A weighted average between the old estimate and the new information
B.A complete replacement of the old value
C.A sum of all past rewards
D.The probability of the action
Correct Answer: A weighted average between the old estimate and the new information
Explanation:The update rule is a form of exponential moving average, smoothing the estimate over time.
Incorrect! Try again.
49What happens if the exploration rate in -greedy is set to 1?
A.The agent acts purely greedily
B.The agent acts completely randomly
C.The agent stops learning
D.The agent alternates actions
Correct Answer: The agent acts completely randomly
Explanation:If , the agent chooses a random action 100% of the time.
Incorrect! Try again.
50Generally, how does TD learning compare to Monte Carlo in terms of variance?
A.TD has higher variance
B.TD has lower variance
C.They have the same variance
D.Variance is not a factor in RL
Correct Answer: TD has lower variance
Explanation:Because TD updates based on one step (or a few steps), it is less affected by the randomness of the entire remaining trajectory compared to MC, leading to lower variance (though potentially higher bias).
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.