Unit 4 - Practice Quiz

INT423

1 What is the primary goal of an agent in Reinforcement Learning?

A. To classify data into distinct categories
B. To maximize the cumulative reward over time
C. To minimize the error in prediction
D. To cluster similar data points together

2 Which of the following elements is NOT a core component of a Reinforcement Learning system?

A. Agent
B. Environment
C. Supervisor Labels
D. Reward Signal

3 In the context of RL, what does the 'Markov Property' imply about the state?

A. The future depends on the past history of all states.
B. The future depends only on the current state and action, not the history.
C. The state is independent of the actions taken.
D. The state transition is always deterministic.

4 What does a 'Policy' represent in Reinforcement Learning?

A. The probability of moving from one state to another
B. The immediate reward received after an action
C. A mapping from perceived states to actions to be taken
D. The calculation of total future reward

5 In an MDP, what does the discount factor (gamma, γ) determine?

A. The probability of choosing a random action
B. The importance of future rewards relative to immediate rewards
C. The learning rate of the algorithm
D. The magnitude of the transition probability

6 Which tuple represents a finite Markov Decision Process (MDP)?

A. (S, A, P, R)
B. (S, A, P, R, γ)
C. (S, P, R, γ)
D. (S, A, R, γ)

7 What is the difference between a Value Function V(s) and an Action-Value Function Q(s, a)?

A. V(s) includes the action taken, while Q(s, a) does not.
B. V(s) estimates the return of a state, while Q(s, a) estimates the return of taking an action in a state.
C. V(s) is for continuous spaces, Q(s, a) is for discrete spaces.
D. There is no mathematical difference.

8 The Bellman Equation expresses the relationship between:

A. The value of a state and the values of its successor states
B. The agent and the environment
C. The exploration rate and the exploitation rate
D. The current reward and the previous reward

9 Which method requires the completion of an entire episode before updating the value estimates?

A. Temporal Difference Learning
B. Dynamic Programming
C. Monte Carlo Learning
D. Q-Learning

10 What is 'Bootstrapping' in the context of Temporal Difference (TD) learning?

A. Restarting the learning process from scratch
B. Updating an estimate based on another estimate
C. Using random weights for initialization
D. Running multiple episodes in parallel

11 In the Exploration vs. Exploitation trade-off, what does 'Exploitation' refer to?

A. Trying new actions to find better rewards
B. Choosing the action currently believed to be the best
C. Randomly selecting actions
D. Ignoring the reward signal

12 What is the Epsilon-Greedy strategy?

A. Always choosing the best action
B. Always choosing a random action
C. Choosing the best action most of the time, but a random action with probability epsilon
D. Choosing the action with the lowest value

13 Which equation represents the Bellman Optimality Equation for V*(s)?

A. V(s) = max_a Σ P(s'|s,a) [R + γV(s')]
B. V(s) = Σ P(s'|s,a) [R + γV(s')]
C. V(s) = R + γV(s')
D. V*(s) = max_a (R)

14 In Monte Carlo learning, what is the difference between 'First-visit' and 'Every-visit' MC?

A. First-visit is faster; Every-visit is slower.
B. First-visit updates only the first time a state is visited in an episode; Every-visit updates for all visits.
C. First-visit uses bootstrapping; Every-visit does not.
D. Every-visit is for continuous tasks; First-visit is for episodic tasks.

15 What is the TD(0) update rule for V(s)?

A. V(s) ← V(s) + α [Gt - V(s)]
B. V(s) ← V(s) + α [R + γV(s') - V(s)]
C. V(s) ← max(Q(s, a))
D. V(s) ← R + γV(s')

16 Which of the following describes a 'Model-Free' RL approach?

A. The agent learns the transition probabilities and reward function explicitly.
B. The agent plans by simulating future states.
C. The agent learns a policy or value function directly from experience without knowing the environment's dynamics.
D. The agent requires a supervisor to model the environment.

17 What is the return (Gt) in Reinforcement Learning?

A. The immediate reward received
B. The total discounted sum of future rewards
C. The average reward of the episode
D. The final reward at the terminal state

18 If the discount factor γ is 0, the agent is:

A. Myopic (cares only about immediate reward)
B. Far-sighted (cares only about long-term reward)
C. Random
D. Optimal

19 What is an 'Episodic Task'?

A. A task that continues indefinitely without end
B. A task that breaks interaction into subsequences called episodes which end in a terminal state
C. A task where the reward is always zero
D. A task with only one state

20 Comparing MC and TD methods, which statement is true regarding variance and bias?

A. MC has low variance, high bias.
B. TD has high variance, low bias.
C. MC has high variance, zero bias; TD has low variance, some bias.
D. MC and TD have identical variance and bias properties.

21 In the context of the Bellman Equation, what is p(s', r | s, a)?

A. The policy function
B. The value function
C. The dynamics function (probability of next state and reward)
D. The discount factor

22 What is a 'Deterministic Policy'?

A. A policy that maps a state to a probability distribution over actions
B. A policy that maps a state to a specific, single action
C. A policy that changes over time
D. A policy that ignores the state

23 Which learning method performs updates step-by-step without waiting for the episode to end?

A. Monte Carlo
B. Temporal Difference
C. Exhaustive Search
D. Batch Learning

24 The term 'Greedy Action' implies:

A. Selecting the action with the highest estimated value
B. Selecting the action with the lowest cost
C. Selecting a random action
D. Selecting an action that maximizes exploration

25 What is the role of the Value Function?

A. To define the rules of the environment
B. To predict how good it is to be in a specific state
C. To generate random numbers
D. To store the immediate reward

26 Which of the following is NOT a challenge in Reinforcement Learning?

A. Exploration vs Exploitation
B. Delayed Reward
C. Credit Assignment Problem
D. Availability of labeled training data

27 In the equation Gt = R{t+1} + γR{t+2} + γ^2R{t+3} + ... , what is G_t?

A. The value function
B. The discounted return
C. The policy
D. The transition probability

28 Why is exploration necessary in Reinforcement Learning?

A. To avoid overfitting
B. To discover states and actions that might yield higher rewards than the current best known options
C. To speed up the calculation of the Bellman equation
D. To minimize the discount factor

29 Which of the following is an Off-Policy control method?

A. SARSA
B. Q-Learning
C. Monte Carlo Policy Evaluation
D. Standard TD Prediction

30 What is the 'Credit Assignment Problem' in RL?

A. Determining which past action is responsible for a current reward
B. Assigning monetary value to states
C. Calculating the computational cost of the algorithm
D. Deciding how much memory to allocate

31 An optimal policy π* is defined as:

A. A policy that is better than or equal to all other policies
B. A policy that reaches the terminal state fastest
C. A policy that explores every state
D. A policy with zero discount factor

32 In Q-Learning, the target value for the update is:

A. R + γ Q(s', a')
B. R + γ max_a' Q(s', a')
C. The actual return Gt
D. V(s')

33 Monte Carlo methods are applicable only to:

A. Continuous tasks
B. Episodic tasks
C. Tasks with known models
D. Deterministic environments

34 What does SARSA stand for?

A. State-Action-Reward-State-Action
B. State-Action-Return-State-Average
C. System-Action-Reward-System-Action
D. Search-And-Retrieve-Sorted-Arrays

35 When does the 'Optimistic Initial Values' technique encourage exploration?

A. When initial value estimates are set very low
B. When initial value estimates are set higher than the expected maximum reward
C. When epsilon is set to 0
D. When the discount factor is 1

36 Which Bellman equation is linear?

A. Bellman Expectation Equation
B. Bellman Optimality Equation
C. Both
D. Neither

37 What is the main advantage of TD learning over Monte Carlo?

A. It is unbiased
B. It can learn online during an episode
C. It works better for non-Markov environments
D. It requires less memory

38 The sequence of states and actions S0, A0, R1, S1, A1, R2... is called:

A. A Policy
B. A Trajectory
C. A Model
D. A Value Function

39 In a stochastic environment:

A. Taking an action always leads to the same next state
B. Taking an action leads to a next state based on a probability distribution
C. Rewards are not provided
D. The agent cannot learn

40 Which algorithm is considered 'On-Policy'?

A. Q-Learning
B. SARSA
C. Max-Q
D. Off-Policy MC

41 The quantity R + γV(s') is often called the:

A. TD Error
B. TD Target
C. Monte Carlo Return
D. Exploration Bonus

42 If an agent uses a pure Greedy strategy (epsilon=0), it:

A. Never explores
B. Explores randomly
C. Explores 50% of the time
D. Alternates between exploration and exploitation

43 The State-Value function V_π(s) is the expected return starting from state s and then following:

A. The optimal policy
B. Policy π
C. A random policy
D. The greedy policy

44 Dynamic Programming (DP) methods in RL assume:

A. A perfect model of the environment is available
B. The environment is unknown
C. Monte Carlo sampling is used
D. Rewards are always positive

45 What is 'Policy Improvement'?

A. Calculating the value function for a policy
B. Making a new policy that is greedy with respect to the current value function
C. Increasing the learning rate
D. Collecting more data

46 Upper Confidence Bound (UCB) is an algorithm used to handle:

A. The Exploration-Exploitation Dilemma
B. The Bellman Equation
C. Discount Factors
D. Continuous State Spaces

47 In the TD error equation δ = R + γV(s') - V(s), what does δ represent?

A. The difference between the target and the current estimate
B. The total return
C. The probability of the next state
D. The learning rate

48 A key distinction between Reinforcement Learning and Unsupervised Learning is:

A. RL uses labeled data
B. RL maximizes a reward signal, Unsupervised Learning finds hidden structure
C. RL is for clustering
D. Unsupervised Learning uses a supervisor

49 Which of the following creates a 'Continuous Task'?

A. Chess
B. Go
C. A robot balancing only for 10 seconds
D. An automated stock trading agent operating indefinitely

50 Policy Iteration consists of two alternating steps:

A. Policy Evaluation and Policy Improvement
B. Exploration and Exploitation
C. Monte Carlo and TD
D. Prediction and Control