1What is the primary goal of an agent in Reinforcement Learning?
A.To classify data into distinct categories
B.To maximize the cumulative reward over time
C.To minimize the error in prediction
D.To cluster similar data points together
Correct Answer: To maximize the cumulative reward over time
Explanation:In Reinforcement Learning, the agent interacts with an environment and aims to learn a policy that maximizes the total amount of reward it receives over the long run.
Incorrect! Try again.
2Which of the following elements is NOT a core component of a Reinforcement Learning system?
A.Agent
B.Environment
C.Supervisor Labels
D.Reward Signal
Correct Answer: Supervisor Labels
Explanation:Reinforcement Learning is distinct from Supervised Learning because it relies on a reward signal from the environment rather than labeled input/output pairs provided by a supervisor.
Incorrect! Try again.
3In the context of RL, what does the 'Markov Property' imply about the state?
A.The future depends on the past history of all states.
B.The future depends only on the current state and action, not the history.
C.The state is independent of the actions taken.
D.The state transition is always deterministic.
Correct Answer: The future depends only on the current state and action, not the history.
Explanation:A state signal has the Markov property if the current state captures all relevant information from the history, making the future independent of the past given the present.
Incorrect! Try again.
4What does a 'Policy' represent in Reinforcement Learning?
A.The probability of moving from one state to another
B.The immediate reward received after an action
C.A mapping from perceived states to actions to be taken
D.The calculation of total future reward
Correct Answer: A mapping from perceived states to actions to be taken
Explanation:A policy defines the agent's behavior. It maps states of the environment to the actions the agent should take when in those states.
Incorrect! Try again.
5In an MDP, what does the discount factor (gamma, γ) determine?
A.The probability of choosing a random action
B.The importance of future rewards relative to immediate rewards
C.The learning rate of the algorithm
D.The magnitude of the transition probability
Correct Answer: The importance of future rewards relative to immediate rewards
Explanation:The discount factor (0 <= γ <= 1) determines the present value of future rewards. A value close to 0 makes the agent shortsighted, while a value close to 1 makes it farsighted.
Incorrect! Try again.
6Which tuple represents a finite Markov Decision Process (MDP)?
A.(S, A, P, R)
B.(S, A, P, R, γ)
C.(S, P, R, γ)
D.(S, A, R, γ)
Correct Answer: (S, A, P, R, γ)
Explanation:An MDP is typically defined by a tuple containing States (S), Actions (A), Transition Probabilities (P), Reward Function (R), and a Discount Factor (γ).
Incorrect! Try again.
7What is the difference between a Value Function V(s) and an Action-Value Function Q(s, a)?
A.V(s) includes the action taken, while Q(s, a) does not.
B.V(s) estimates the return of a state, while Q(s, a) estimates the return of taking an action in a state.
C.V(s) is for continuous spaces, Q(s, a) is for discrete spaces.
D.There is no mathematical difference.
Correct Answer: V(s) estimates the return of a state, while Q(s, a) estimates the return of taking an action in a state.
Explanation:V(s) (State-Value) predicts the expected return starting from state s. Q(s, a) (Action-Value) predicts the expected return starting from state s, taking action a.
Incorrect! Try again.
8The Bellman Equation expresses the relationship between:
A.The value of a state and the values of its successor states
B.The agent and the environment
C.The exploration rate and the exploitation rate
D.The current reward and the previous reward
Correct Answer: The value of a state and the values of its successor states
Explanation:The Bellman Equation is a recursive definition that decomposes the value function into the immediate reward plus the discounted value of the next state.
Incorrect! Try again.
9Which method requires the completion of an entire episode before updating the value estimates?
A.Temporal Difference Learning
B.Dynamic Programming
C.Monte Carlo Learning
D.Q-Learning
Correct Answer: Monte Carlo Learning
Explanation:Monte Carlo methods wait until the end of an episode to calculate the actual return (sum of rewards) and then update the value estimates based on that return.
Incorrect! Try again.
10What is 'Bootstrapping' in the context of Temporal Difference (TD) learning?
A.Restarting the learning process from scratch
B.Updating an estimate based on another estimate
C.Using random weights for initialization
D.Running multiple episodes in parallel
Correct Answer: Updating an estimate based on another estimate
Explanation:TD learning bootstraps because it updates its guess for the value of a state based on its guess for the value of the next state, rather than waiting for the final actual return.
Incorrect! Try again.
11In the Exploration vs. Exploitation trade-off, what does 'Exploitation' refer to?
A.Trying new actions to find better rewards
B.Choosing the action currently believed to be the best
C.Randomly selecting actions
D.Ignoring the reward signal
Correct Answer: Choosing the action currently believed to be the best
Explanation:Exploitation involves maximizing current expected reward by selecting the action that the agent already knows yields the highest value.
Incorrect! Try again.
12What is the Epsilon-Greedy strategy?
A.Always choosing the best action
B.Always choosing a random action
C.Choosing the best action most of the time, but a random action with probability epsilon
D.Choosing the action with the lowest value
Correct Answer: Choosing the best action most of the time, but a random action with probability epsilon
Explanation:Epsilon-Greedy balances exploration and exploitation. With probability 1-epsilon it exploits (greedy), and with probability epsilon it explores (random action).
Incorrect! Try again.
13Which equation represents the Bellman Optimality Equation for V*(s)?
Explanation:The Bellman Optimality Equation states that the value of a state under an optimal policy must equal the expected return for the best action from that state.
Incorrect! Try again.
14In Monte Carlo learning, what is the difference between 'First-visit' and 'Every-visit' MC?
A.First-visit is faster; Every-visit is slower.
B.First-visit updates only the first time a state is visited in an episode; Every-visit updates for all visits.
C.First-visit uses bootstrapping; Every-visit does not.
D.Every-visit is for continuous tasks; First-visit is for episodic tasks.
Correct Answer: First-visit updates only the first time a state is visited in an episode; Every-visit updates for all visits.
Explanation:In an episode, a state might be visited multiple times. First-visit MC averages returns only from the first occurrence, while Every-visit MC averages returns following all visits to that state.
Explanation:The TD(0) update moves the current value estimate V(s) toward the TD target (R + γV(s')), where alpha is the learning rate.
Incorrect! Try again.
16Which of the following describes a 'Model-Free' RL approach?
A.The agent learns the transition probabilities and reward function explicitly.
B.The agent plans by simulating future states.
C.The agent learns a policy or value function directly from experience without knowing the environment's dynamics.
D.The agent requires a supervisor to model the environment.
Correct Answer: The agent learns a policy or value function directly from experience without knowing the environment's dynamics.
Explanation:Model-free methods (like MC and TD) do not require a model of the environment (transition probabilities P and rewards R) to learn; they rely on trial-and-error experience.
Incorrect! Try again.
17What is the return (Gt) in Reinforcement Learning?
A.The immediate reward received
B.The total discounted sum of future rewards
C.The average reward of the episode
D.The final reward at the terminal state
Correct Answer: The total discounted sum of future rewards
Explanation:The return Gt is the goal of the agent, calculated as the sum of discounted rewards from time step t until the end of the episode.
Incorrect! Try again.
18If the discount factor γ is 0, the agent is:
A.Myopic (cares only about immediate reward)
B.Far-sighted (cares only about long-term reward)
C.Random
D.Optimal
Correct Answer: Myopic (cares only about immediate reward)
Explanation:If γ=0, the return Gt equals the immediate reward Rt, meaning the agent ignores all future consequences and focuses solely on the current step.
Incorrect! Try again.
19What is an 'Episodic Task'?
A.A task that continues indefinitely without end
B.A task that breaks interaction into subsequences called episodes which end in a terminal state
C.A task where the reward is always zero
D.A task with only one state
Correct Answer: A task that breaks interaction into subsequences called episodes which end in a terminal state
Explanation:Episodic tasks have a defined starting point and a terminal state (like a game of chess). Once the terminal state is reached, the episode ends and the system resets.
Incorrect! Try again.
20Comparing MC and TD methods, which statement is true regarding variance and bias?
A.MC has low variance, high bias.
B.TD has high variance, low bias.
C.MC has high variance, zero bias; TD has low variance, some bias.
D.MC and TD have identical variance and bias properties.
Correct Answer: MC has high variance, zero bias; TD has low variance, some bias.
Explanation:MC is unbiased because it uses actual returns, but has high variance due to stochasticity over a full episode. TD introduces bias via bootstrapping but has lower variance.
Incorrect! Try again.
21In the context of the Bellman Equation, what is p(s', r | s, a)?
A.The policy function
B.The value function
C.The dynamics function (probability of next state and reward)
D.The discount factor
Correct Answer: The dynamics function (probability of next state and reward)
Explanation:This term represents the joint probability of transitioning to state s' and receiving reward r, given that the agent takes action a in state s.
Incorrect! Try again.
22What is a 'Deterministic Policy'?
A.A policy that maps a state to a probability distribution over actions
B.A policy that maps a state to a specific, single action
C.A policy that changes over time
D.A policy that ignores the state
Correct Answer: A policy that maps a state to a specific, single action
Explanation:A deterministic policy denotes that for every state s, there is exactly one action a = π(s) that the agent will take.
Incorrect! Try again.
23Which learning method performs updates step-by-step without waiting for the episode to end?
A.Monte Carlo
B.Temporal Difference
C.Exhaustive Search
D.Batch Learning
Correct Answer: Temporal Difference
Explanation:Temporal Difference (TD) learning updates estimates after every time step (or n steps) using the immediate reward and the estimate of the next state.
Incorrect! Try again.
24The term 'Greedy Action' implies:
A.Selecting the action with the highest estimated value
B.Selecting the action with the lowest cost
C.Selecting a random action
D.Selecting an action that maximizes exploration
Correct Answer: Selecting the action with the highest estimated value
Explanation:A greedy action is one that exploits current knowledge by choosing the action associated with the maximum Q-value for the current state.
Incorrect! Try again.
25What is the role of the Value Function?
A.To define the rules of the environment
B.To predict how good it is to be in a specific state
C.To generate random numbers
D.To store the immediate reward
Correct Answer: To predict how good it is to be in a specific state
Explanation:Value functions quantify the 'goodness' of a state (or state-action pair) defined as the expected future reward achievable from that state.
Incorrect! Try again.
26Which of the following is NOT a challenge in Reinforcement Learning?
A.Exploration vs Exploitation
B.Delayed Reward
C.Credit Assignment Problem
D.Availability of labeled training data
Correct Answer: Availability of labeled training data
Explanation:Labeled training data is a requirement for Supervised Learning, not RL. RL agents learn from scalar reward signals, not correct answer labels.
Incorrect! Try again.
27In the equation Gt = R{t+1} + γR{t+2} + γ^2R{t+3} + ... , what is G_t?
A.The value function
B.The discounted return
C.The policy
D.The transition probability
Correct Answer: The discounted return
Explanation:G_t represents the sum of discounted future rewards starting from time t, known as the return.
Incorrect! Try again.
28Why is exploration necessary in Reinforcement Learning?
A.To avoid overfitting
B.To discover states and actions that might yield higher rewards than the current best known options
C.To speed up the calculation of the Bellman equation
D.To minimize the discount factor
Correct Answer: To discover states and actions that might yield higher rewards than the current best known options
Explanation:If an agent only exploits, it may get stuck in a suboptimal solution. Exploration ensures the agent gathers enough information about the environment to find the global optimum.
Incorrect! Try again.
29Which of the following is an Off-Policy control method?
A.SARSA
B.Q-Learning
C.Monte Carlo Policy Evaluation
D.Standard TD Prediction
Correct Answer: Q-Learning
Explanation:Q-Learning is off-policy because it learns the value of the optimal policy (using max over actions) independently of the agent's actual actions (which might be epsilon-greedy).
Incorrect! Try again.
30What is the 'Credit Assignment Problem' in RL?
A.Determining which past action is responsible for a current reward
B.Assigning monetary value to states
C.Calculating the computational cost of the algorithm
D.Deciding how much memory to allocate
Correct Answer: Determining which past action is responsible for a current reward
Explanation:Because rewards are often delayed, it is difficult to determine exactly which action in a long sequence caused the positive or negative outcome.
Incorrect! Try again.
31An optimal policy π* is defined as:
A.A policy that is better than or equal to all other policies
B.A policy that reaches the terminal state fastest
C.A policy that explores every state
D.A policy with zero discount factor
Correct Answer: A policy that is better than or equal to all other policies
Explanation:A policy π is optimal if its expected return Vπ(s) is greater than or equal to Vπ(s) for all states s and all other policies π.
Incorrect! Try again.
32In Q-Learning, the target value for the update is:
A.R + γ Q(s', a')
B.R + γ max_a' Q(s', a')
C.The actual return Gt
D.V(s')
Correct Answer: R + γ max_a' Q(s', a')
Explanation:Q-Learning updates the Q-value towards the immediate reward plus the discounted value of the best possible action in the next state (greedy target).
Incorrect! Try again.
33Monte Carlo methods are applicable only to:
A.Continuous tasks
B.Episodic tasks
C.Tasks with known models
D.Deterministic environments
Correct Answer: Episodic tasks
Explanation:Monte Carlo methods require the episode to terminate so that the final return can be calculated and used for updates.
Incorrect! Try again.
34What does SARSA stand for?
A.State-Action-Reward-State-Action
B.State-Action-Return-State-Average
C.System-Action-Reward-System-Action
D.Search-And-Retrieve-Sorted-Arrays
Correct Answer: State-Action-Reward-State-Action
Explanation:SARSA represents the quintuple (S_t, At, R{t+1}, S{t+1}, A{t+1}) used in the update rule for this on-policy TD control method.
Incorrect! Try again.
35When does the 'Optimistic Initial Values' technique encourage exploration?
A.When initial value estimates are set very low
B.When initial value estimates are set higher than the expected maximum reward
C.When epsilon is set to 0
D.When the discount factor is 1
Correct Answer: When initial value estimates are set higher than the expected maximum reward
Explanation:If values start high, the agent is 'disappointed' by actual rewards (which are lower) and tries other actions to find the supposed high rewards, thus exploring.
Incorrect! Try again.
36Which Bellman equation is linear?
A.Bellman Expectation Equation
B.Bellman Optimality Equation
C.Both
D.Neither
Correct Answer: Bellman Expectation Equation
Explanation:The Bellman Expectation Equation for a fixed policy is a system of linear equations. The Optimality Equation contains a 'max' operator, making it non-linear.
Incorrect! Try again.
37What is the main advantage of TD learning over Monte Carlo?
A.It is unbiased
B.It can learn online during an episode
C.It works better for non-Markov environments
D.It requires less memory
Correct Answer: It can learn online during an episode
Explanation:TD learning can update values after every step, making it suitable for continuous tasks or very long episodes, whereas MC must wait for the episode to end.
Incorrect! Try again.
38The sequence of states and actions S0, A0, R1, S1, A1, R2... is called:
A.A Policy
B.A Trajectory
C.A Model
D.A Value Function
Correct Answer: A Trajectory
Explanation:A trajectory (or history) is the sequence of states, actions, and rewards encountered by the agent as it interacts with the environment.
Incorrect! Try again.
39In a stochastic environment:
A.Taking an action always leads to the same next state
B.Taking an action leads to a next state based on a probability distribution
C.Rewards are not provided
D.The agent cannot learn
Correct Answer: Taking an action leads to a next state based on a probability distribution
Explanation:Stochasticity means there is randomness in the transitions; doing action A in state S does not guarantee arriving at state S'.
Incorrect! Try again.
40Which algorithm is considered 'On-Policy'?
A.Q-Learning
B.SARSA
C.Max-Q
D.Off-Policy MC
Correct Answer: SARSA
Explanation:SARSA is on-policy because it updates the Q-values based on the action actually taken by the current policy (including exploratory steps).
Incorrect! Try again.
41The quantity R + γV(s') is often called the:
A.TD Error
B.TD Target
C.Monte Carlo Return
D.Exploration Bonus
Correct Answer: TD Target
Explanation:In TD learning, the estimate is updated towards this value, which acts as the target for the prediction.
Incorrect! Try again.
42If an agent uses a pure Greedy strategy (epsilon=0), it:
A.Never explores
B.Explores randomly
C.Explores 50% of the time
D.Alternates between exploration and exploitation
Correct Answer: Never explores
Explanation:A pure greedy strategy always picks the current best-known action, never trying new actions to see if they are better (zero exploration).
Incorrect! Try again.
43The State-Value function V_π(s) is the expected return starting from state s and then following:
A.The optimal policy
B.Policy π
C.A random policy
D.The greedy policy
Correct Answer: Policy π
Explanation:V_π(s) specifically evaluates the expected return if the agent behaves according to the specific policy π.
Incorrect! Try again.
44Dynamic Programming (DP) methods in RL assume:
A.A perfect model of the environment is available
B.The environment is unknown
C.Monte Carlo sampling is used
D.Rewards are always positive
Correct Answer: A perfect model of the environment is available
Explanation:DP algorithms (like Policy Iteration and Value Iteration) require knowledge of the transition probabilities and reward functions (the model) to compute values.
Incorrect! Try again.
45What is 'Policy Improvement'?
A.Calculating the value function for a policy
B.Making a new policy that is greedy with respect to the current value function
C.Increasing the learning rate
D.Collecting more data
Correct Answer: Making a new policy that is greedy with respect to the current value function
Explanation:Policy Improvement generates a better policy π' by acting greedily with respect to the value function of the current policy π.
Incorrect! Try again.
46Upper Confidence Bound (UCB) is an algorithm used to handle:
A.The Exploration-Exploitation Dilemma
B.The Bellman Equation
C.Discount Factors
D.Continuous State Spaces
Correct Answer: The Exploration-Exploitation Dilemma
Explanation:UCB selects actions based on their estimated value plus a confidence interval term, encouraging exploration of actions with uncertain values.
Incorrect! Try again.
47In the TD error equation δ = R + γV(s') - V(s), what does δ represent?
A.The difference between the target and the current estimate
B.The total return
C.The probability of the next state
D.The learning rate
Correct Answer: The difference between the target and the current estimate
Explanation:The TD error (delta) measures the surprise or difference between the improved estimate (target) and the current estimate.
Incorrect! Try again.
48A key distinction between Reinforcement Learning and Unsupervised Learning is:
A.RL uses labeled data
B.RL maximizes a reward signal, Unsupervised Learning finds hidden structure
Explanation:RL is driven by maximizing rewards, whereas Unsupervised Learning focuses on finding patterns or structure in unlabeled data without a reward signal.
Incorrect! Try again.
49Which of the following creates a 'Continuous Task'?
Correct Answer: An automated stock trading agent operating indefinitely
Explanation:Continuous tasks do not have a natural end point; the agent operates continuously (gamma usually must be < 1 to keep returns finite).
Incorrect! Try again.
50Policy Iteration consists of two alternating steps:
A.Policy Evaluation and Policy Improvement
B.Exploration and Exploitation
C.Monte Carlo and TD
D.Prediction and Control
Correct Answer: Policy Evaluation and Policy Improvement
Explanation:Policy Iteration finds the optimal policy by repeatedly evaluating the current policy (calculating V) and then improving it (making it greedy).
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.