Unit 4 - Practice Quiz

INT423 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary goal of an agent in Reinforcement Learning?

A. To minimize the error in prediction
B. To cluster similar data points together
C. To maximize the cumulative reward over time
D. To classify data into distinct categories

2 Which of the following elements is NOT a core component of a Reinforcement Learning system?

A. Agent
B. Reward Signal
C. Supervisor Labels
D. Environment

3 In the context of RL, what does the 'Markov Property' imply about the state?

A. The future depends on the past history of all states.
B. The state transition is always deterministic.
C. The future depends only on the current state and action, not the history.
D. The state is independent of the actions taken.

4 What does a 'Policy' represent in Reinforcement Learning?

A. The immediate reward received after an action
B. The calculation of total future reward
C. The probability of moving from one state to another
D. A mapping from perceived states to actions to be taken

5 In an MDP, what does the discount factor (gamma, γ) determine?

A. The magnitude of the transition probability
B. The importance of future rewards relative to immediate rewards
C. The learning rate of the algorithm
D. The probability of choosing a random action

6 Which tuple represents a finite Markov Decision Process (MDP)?

A. (S, A, R, γ)
B. (S, A, P, R)
C. (S, A, P, R, γ)
D. (S, P, R, γ)

7 What is the difference between a Value Function V(s) and an Action-Value Function Q(s, a)?

A. There is no mathematical difference.
B. V(s) is for continuous spaces, Q(s, a) is for discrete spaces.
C. V(s) estimates the return of a state, while Q(s, a) estimates the return of taking an action in a state.
D. V(s) includes the action taken, while Q(s, a) does not.

8 The Bellman Equation expresses the relationship between:

A. The agent and the environment
B. The exploration rate and the exploitation rate
C. The current reward and the previous reward
D. The value of a state and the values of its successor states

9 Which method requires the completion of an entire episode before updating the value estimates?

A. Monte Carlo Learning
B. Temporal Difference Learning
C. Q-Learning
D. Dynamic Programming

10 What is 'Bootstrapping' in the context of Temporal Difference (TD) learning?

A. Running multiple episodes in parallel
B. Updating an estimate based on another estimate
C. Using random weights for initialization
D. Restarting the learning process from scratch

11 In the Exploration vs. Exploitation trade-off, what does 'Exploitation' refer to?

A. Trying new actions to find better rewards
B. Ignoring the reward signal
C. Choosing the action currently believed to be the best
D. Randomly selecting actions

12 What is the Epsilon-Greedy strategy?

A. Always choosing a random action
B. Always choosing the best action
C. Choosing the action with the lowest value
D. Choosing the best action most of the time, but a random action with probability epsilon

13 Which equation represents the Bellman Optimality Equation for V*(s)?

A. V(s) = Σ P(s'|s,a) [R + γV(s')]
B. V(s) = max_a Σ P(s'|s,a) [R + γV(s')]
C. V*(s) = max_a (R)
D. V(s) = R + γV(s')

14 In Monte Carlo learning, what is the difference between 'First-visit' and 'Every-visit' MC?

A. First-visit is faster; Every-visit is slower.
B. First-visit updates only the first time a state is visited in an episode; Every-visit updates for all visits.
C. Every-visit is for continuous tasks; First-visit is for episodic tasks.
D. First-visit uses bootstrapping; Every-visit does not.

15 What is the TD(0) update rule for V(s)?

A. V(s) ← V(s) + α [Gt - V(s)]
B. V(s) ← max(Q(s, a))
C. V(s) ← V(s) + α [R + γV(s') - V(s)]
D. V(s) ← R + γV(s')

16 Which of the following describes a 'Model-Free' RL approach?

A. The agent plans by simulating future states.
B. The agent learns a policy or value function directly from experience without knowing the environment's dynamics.
C. The agent requires a supervisor to model the environment.
D. The agent learns the transition probabilities and reward function explicitly.

17 What is the return (Gt) in Reinforcement Learning?

A. The total discounted sum of future rewards
B. The immediate reward received
C. The final reward at the terminal state
D. The average reward of the episode

18 If the discount factor γ is 0, the agent is:

A. Myopic (cares only about immediate reward)
B. Far-sighted (cares only about long-term reward)
C. Random
D. Optimal

19 What is an 'Episodic Task'?

A. A task that breaks interaction into subsequences called episodes which end in a terminal state
B. A task with only one state
C. A task that continues indefinitely without end
D. A task where the reward is always zero

20 Comparing MC and TD methods, which statement is true regarding variance and bias?

A. MC has low variance, high bias.
B. MC has high variance, zero bias; TD has low variance, some bias.
C. TD has high variance, low bias.
D. MC and TD have identical variance and bias properties.

21 In the context of the Bellman Equation, what is p(s', r | s, a)?

A. The discount factor
B. The value function
C. The dynamics function (probability of next state and reward)
D. The policy function

22 What is a 'Deterministic Policy'?

A. A policy that maps a state to a specific, single action
B. A policy that maps a state to a probability distribution over actions
C. A policy that changes over time
D. A policy that ignores the state

23 Which learning method performs updates step-by-step without waiting for the episode to end?

A. Batch Learning
B. Exhaustive Search
C. Monte Carlo
D. Temporal Difference

24 The term 'Greedy Action' implies:

A. Selecting the action with the highest estimated value
B. Selecting the action with the lowest cost
C. Selecting an action that maximizes exploration
D. Selecting a random action

25 What is the role of the Value Function?

A. To define the rules of the environment
B. To predict how good it is to be in a specific state
C. To generate random numbers
D. To store the immediate reward

26 Which of the following is NOT a challenge in Reinforcement Learning?

A. Delayed Reward
B. Credit Assignment Problem
C. Availability of labeled training data
D. Exploration vs Exploitation

27 In the equation Gt = R{t+1} + γR{t+2} + γ^2R{t+3} + ... , what is G_t?

A. The transition probability
B. The value function
C. The policy
D. The discounted return

28 Why is exploration necessary in Reinforcement Learning?

A. To avoid overfitting
B. To speed up the calculation of the Bellman equation
C. To discover states and actions that might yield higher rewards than the current best known options
D. To minimize the discount factor

29 Which of the following is an Off-Policy control method?

A. SARSA
B. Monte Carlo Policy Evaluation
C. Standard TD Prediction
D. Q-Learning

30 What is the 'Credit Assignment Problem' in RL?

A. Assigning monetary value to states
B. Deciding how much memory to allocate
C. Determining which past action is responsible for a current reward
D. Calculating the computational cost of the algorithm

31 An optimal policy π* is defined as:

A. A policy with zero discount factor
B. A policy that explores every state
C. A policy that is better than or equal to all other policies
D. A policy that reaches the terminal state fastest

32 In Q-Learning, the target value for the update is:

A. R + γ Q(s', a')
B. R + γ max_a' Q(s', a')
C. V(s')
D. The actual return Gt

33 Monte Carlo methods are applicable only to:

A. Tasks with known models
B. Deterministic environments
C. Continuous tasks
D. Episodic tasks

34 What does SARSA stand for?

A. System-Action-Reward-System-Action
B. State-Action-Return-State-Average
C. State-Action-Reward-State-Action
D. Search-And-Retrieve-Sorted-Arrays

35 When does the 'Optimistic Initial Values' technique encourage exploration?

A. When initial value estimates are set very low
B. When initial value estimates are set higher than the expected maximum reward
C. When the discount factor is 1
D. When epsilon is set to 0

36 Which Bellman equation is linear?

A. Bellman Expectation Equation
B. Bellman Optimality Equation
C. Both
D. Neither

37 What is the main advantage of TD learning over Monte Carlo?

A. It works better for non-Markov environments
B. It can learn online during an episode
C. It is unbiased
D. It requires less memory

38 The sequence of states and actions S0, A0, R1, S1, A1, R2... is called:

A. A Value Function
B. A Policy
C. A Model
D. A Trajectory

39 In a stochastic environment:

A. Taking an action leads to a next state based on a probability distribution
B. Taking an action always leads to the same next state
C. The agent cannot learn
D. Rewards are not provided

40 Which algorithm is considered 'On-Policy'?

A. Q-Learning
B. Max-Q
C. Off-Policy MC
D. SARSA

41 The quantity R + γV(s') is often called the:

A. Monte Carlo Return
B. TD Target
C. TD Error
D. Exploration Bonus

42 If an agent uses a pure Greedy strategy (epsilon=0), it:

A. Alternates between exploration and exploitation
B. Explores 50% of the time
C. Explores randomly
D. Never explores

43 The State-Value function V_π(s) is the expected return starting from state s and then following:

A. The greedy policy
B. A random policy
C. Policy π
D. The optimal policy

44 Dynamic Programming (DP) methods in RL assume:

A. Monte Carlo sampling is used
B. The environment is unknown
C. A perfect model of the environment is available
D. Rewards are always positive

45 What is 'Policy Improvement'?

A. Calculating the value function for a policy
B. Making a new policy that is greedy with respect to the current value function
C. Increasing the learning rate
D. Collecting more data

46 Upper Confidence Bound (UCB) is an algorithm used to handle:

A. Discount Factors
B. Continuous State Spaces
C. The Bellman Equation
D. The Exploration-Exploitation Dilemma

47 In the TD error equation δ = R + γV(s') - V(s), what does δ represent?

A. The total return
B. The difference between the target and the current estimate
C. The learning rate
D. The probability of the next state

48 A key distinction between Reinforcement Learning and Unsupervised Learning is:

A. RL maximizes a reward signal, Unsupervised Learning finds hidden structure
B. RL is for clustering
C. Unsupervised Learning uses a supervisor
D. RL uses labeled data

49 Which of the following creates a 'Continuous Task'?

A. Go
B. An automated stock trading agent operating indefinitely
C. A robot balancing only for 10 seconds
D. Chess

50 Policy Iteration consists of two alternating steps:

A. Exploration and Exploitation
B. Policy Evaluation and Policy Improvement
C. Prediction and Control
D. Monte Carlo and TD