Unit 5 - Practice Quiz

INT394 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary goal of an agent in Reinforcement Learning?

A. To minimize the reconstruction error of the input data
B. To maximize the cumulative reward over time
C. To classify data into distinct categories based on labeled examples
D. To find hidden structures in unlabeled data

2 Which of the following tuple representations correctly defines a Markov Decision Process (MDP)?

A.
B.
C.
D.

3 What does the Markov Property imply about the state of an environment?

A. The current state provides no information about the future
B. The future depends on the entire history of past states
C. The future is independent of the past given the present
D. The transition probabilities change over time

4 In the context of RL, what does the discount factor (gamma) control?

A. The learning rate of the agent
B. The probability of transitioning to a random state
C. The importance of immediate rewards versus future rewards
D. The exploration rate of the agent

5 What distinguishes Reinforcement Learning from Supervised Learning?

A. RL is only used for continuous value prediction
B. RL maps inputs to outputs without any feedback
C. RL learns from interaction and delayed feedback (rewards) rather than explicit labels
D. RL relies on a static dataset with labeled targets

6 What is a Policy () in Reinforcement Learning?

A. The mechanism that provides rewards to the agent
B. The numerical value indicating the goodness of a state
C. A function that predicts the next state given the current state
D. A mapping from states to actions (or probabilities of actions)

7 Which equation represents the total discounted return ?

A.
B.
C.
D.

8 What does the State-Value Function represent?

A. The probability of moving to state
B. The expected return starting from state and following policy
C. The maximum reward possible in the entire environment
D. The immediate reward received at state

9 What is the Action-Value Function ?

A. The value of taking action in state and then following policy
B. The probability of taking action in state
C. The reward received immediately after taking action
D. The value of being in state regardless of the action taken

10 The Bellman Equation expresses a relationship between:

A. The policy and the reward function only
B. The learning rate and the discount factor
C. The value of a state and the value of its successor states
D. The current observation and the previous observation

11 In the Bellman Optimality Equation, which operator is used to define the optimal value?

A. Average
B. Min (Minimization over costs)
C. Max (Maximization over actions)
D. Summation over time

12 What is the Exploration vs. Exploitation trade-off?

A. Deciding whether to use a neural network or a tabular method
B. Balancing between gathering new information and using known information to maximize reward
C. Choosing between model-based and model-free learning
D. Trading off computation time for memory usage

13 Which method is commonly used to balance exploration and exploitation?

A. -greedy (Epsilon-greedy)
B. Gradient Descent
C. Backpropagation
D. Principal Component Analysis

14 What does it mean for an RL algorithm to be Model-Free?

A. It builds an explicit model of the environment's transition dynamics
B. It does not require knowledge of the transition probability or reward function
C. It cannot solve MDPs
D. It does not use any value functions

15 What is Temporal Difference (TD) Learning?

A. A method that waits until the end of an episode to update values
B. A method that updates estimates based on other learned estimates without waiting for the outcome
C. A supervised learning technique applied to RL
D. A method that requires a complete model of the environment

16 Which of the following is the TD(0) update rule for the state-value function ?

A.
B.
C.
D.

17 What is Bootstrapping in the context of TD learning?

A. Resampling the dataset to create more training data
B. Updating a value estimate using another estimated value
C. Restarting the episode when the agent gets stuck
D. Initializing weights to zero

18 Q-Learning is considered an Off-Policy algorithm. What does this mean?

A. It must follow the exact policy it is trying to learn
B. It requires the environment to be turned off during updates
C. It does not use a policy at all
D. It learns the value of the optimal policy while following a different exploratory policy

19 Which represents the Q-Learning update equation?

A.
B.
C.
D.

20 In the Q-Learning update rule, what is ?

A. Discount factor
B. Learning rate
C. Exploration probability
D. Reward function

21 If , the agent is:

A. Optimal
B. Myopic (short-sighted)
C. Random
D. Infinitely far-sighted

22 What is the key difference between Monte Carlo (MC) methods and TD Learning?

A. TD requires a model of the environment
B. MC is biased while TD is unbiased
C. MC can only be used for continuous states
D. MC updates are performed only after a complete episode, while TD updates can happen at every step

23 Which of the following best describes the Credit Assignment Problem in RL?

A. Assigning memory to store the Q-table
B. Determining which past action is responsible for a current reward
C. Calculating the computational cost of the algorithm
D. Distributing rewards among multiple agents

24 In a tabular Q-learning approach, the Q-table has dimensions of:

A. Number of Actions Number of Rewards
B. Number of States Number of Actions
C. Number of Episodes Time Steps
D. Number of States Number of States

25 What is an Episodic Task?

A. A task where the environment changes randomly
B. A task that requires supervised training data
C. A task that continues forever without limit
D. A task with a well-defined starting and ending point (terminal state)

26 What is a Continuing Task?

A. A task where rewards are always zero
B. A task solvable only by Monte Carlo methods
C. A task that naturally breaks into episodes
D. A task that goes on forever without a terminal state

27 The Bellman Expectation Equation for can be written as:

A.
B.
C.
D.

28 What is the TD Error ()?

A. The difference between two consecutive rewards
B. The difference between the predicted value and the actual target value
C. The probability of taking a wrong action
D. The error in the reward function

29 Which algorithm is known as "on-policy" TD control?

A. Monte Carlo
B. SARSA
C. Q-Learning
D. Value Iteration

30 The SARSA update rule is given by:

A.
B.
C.
D.

31 If a problem has a continuous state space, which challenge arises for tabular Q-learning?

A. The discount factor must be 1
B. The Curse of Dimensionality (table becomes too large)
C. The rewards cannot be calculated
D. The Markov property no longer holds

32 A Deterministic Policy maps:

A. State to a probability distribution over actions
B. Action to a state
C. State to a reward value
D. State to a single action

33 The transition probability represents:

A. The probability of taking action in state
B. The probability of receiving a reward in state
C. The value of state
D. The probability of transitioning to state given state and action

34 Which of the following guarantees the convergence of Q-learning to the optimal ?

A. If the policy is strictly greedy
B. If the discount factor is exactly 1
C. If the environment is deterministic only
D. If all state-action pairs are visited infinitely often and the learning rate decays appropriately

35 What is the value of a Terminal State in an episodic task?

A. The last received reward
B. 1
C. 0
D. Infinity

36 What is the Prediction Problem in RL?

A. Predicting the next state
B. Finding the optimal policy
C. Predicting the immediate reward
D. Estimating the value function for a given policy

37 What is the Control Problem in RL?

A. Controlling the environment parameters
B. Ensuring the agent does not crash
C. Finding the optimal policy that maximizes return
D. Estimating the value of a fixed policy

38 In the context of the Bellman Equation, what does the term 'Recursive' mean?

A. The function calls itself
B. The function is linear
C. The function is undefined
D. The function depends on the previous time step only

39 Which of the following is a model-based algorithm?

A. SARSA
B. Dynamic Programming (Policy Iteration)
C. Monte Carlo
D. Q-Learning

40 Why do we use the max operator in Q-Learning?

A. To ensure the agent explores
B. To estimate the value of the best possible future action
C. To minimize the error
D. To calculate the average reward

41 In an MDP, if is Finite, is Finite, and dynamics are known, which technique can solve for the optimal policy exactly?

A. Random Search
B. Linear Regression
C. Dynamic Programming
D. Clustering

42 What is a Stochastic Policy?

A. A policy that always chooses the same action for a given state
B. A policy that ignores the state
C. A policy used only in deterministic environments
D. A policy where actions are selected based on probabilities

43 In TD Learning, the term is known as:

A. The TD Error
B. The TD Target
C. The Return
D. The Baseline

44 Which of the following is NOT a component of the RL Agent-Environment interface?

A. Supervised Label
B. Action
C. Reward
D. State

45 If an agent always chooses the action with the highest estimated value, it is acting:

A. Greedily
B. Randomly
C. Stochastically
D. Optimally (always guaranteed)

46 What is the relationship between and ?

A.
B.
C.
D.

47 Which represents a purely delayed reward scenario?

A. Winning a game of Chess after many moves
B. A thermostat adjusting every minute
C. Getting a point for every correct step
D. Receiving a salary every day

48 In the equation , what does this represent?

A. The probability of the action
B. A weighted average between the old estimate and the new information
C. A complete replacement of the old value
D. A sum of all past rewards

49 What happens if the exploration rate in -greedy is set to 1?

A. The agent acts completely randomly
B. The agent alternates actions
C. The agent acts purely greedily
D. The agent stops learning

50 Generally, how does TD learning compare to Monte Carlo in terms of variance?

A. They have the same variance
B. TD has lower variance
C. Variance is not a factor in RL
D. TD has higher variance