Unit 5 - Practice Quiz

INT423

1 What type of Reinforcement Learning algorithm is Q-Learning?

A. Model-based, On-policy
B. Model-free, On-policy
C. Model-based, Off-policy
D. Model-free, Off-policy

2 In Q-Learning, what does the 'Q' specifically represent?

A. Quantity
B. Quality
C. Query
D. Queue

3 What is the primary data structure used in basic tabular Q-Learning?

A. A Neural Network
B. A Q-Table
C. A Decision Tree
D. A Graph

4 Which equation is used to update the Q-values in Q-Learning?

A. Maxwell's Equation
B. Bellman Equation
C. Schrodinger Equation
D. Euler's Equation

5 In the Q-learning update rule, what is the role of the learning rate (alpha)?

A. It determines the importance of future rewards.
B. It determines the probability of exploring a random action.
C. It controls how much the new information overrides the old information.
D. It calculates the total cumulative reward.

6 What is the purpose of the discount factor (gamma) in Q-Learning?

A. To balance immediate and future rewards
B. To set the learning speed
C. To determine the exploration rate
D. To initialize the Q-table

7 If the discount factor (gamma) is set to 0, what will the agent optimize for?

A. Long-term cumulative reward
B. Only the immediate reward
C. The average reward over time
D. Random rewards

8 What is the 'Temporal Difference (TD) Error' in the context of Q-Learning?

A. The difference between the current Q-value and the previous Q-value
B. The difference between the target Q-value and the current predicted Q-value
C. The error in the reward function
D. The time it takes to converge

9 What is the Epsilon-Greedy strategy used for?

A. To calculate the loss function
B. To balance exploration and exploitation
C. To update the weights of the network
D. To store experiences in replay memory

10 In an Epsilon-Greedy strategy, what happens if epsilon is 1?

A. The agent always chooses the action with the highest Q-value.
B. The agent always chooses a random action.
C. The agent stops learning.
D. The agent alternates between best and random actions.

11 What is the typical behavior of epsilon during the training process in Deep Q-Learning?

A. It starts low and increases over time.
B. It remains constant throughout training.
C. It starts high and decays over time.
D. It fluctuates randomly.

12 Why does tabular Q-Learning fail in environments like Atari games or Robotics?

A. The math does not apply to games.
B. The rewards are not defined.
C. The state space is too large (Curse of Dimensionality).
D. Q-learning cannot handle discrete actions.

13 What replaces the Q-Table in a Deep Q-Network (DQN)?

A. A larger Q-Table
B. A Deep Neural Network
C. A Genetic Algorithm
D. A Linear Regression model

14 What is the input to the neural network in a standard DQN for playing video games?

A. The current score
B. The raw pixels of the game screen (state)
C. The action to be taken
D. The Q-value

15 What is the output layer size of a DQN used for an environment with 'N' discrete actions?

A. 1 (The best action)
B. 1 (The value of the state)
C. N (One Q-value for each action)
D. N x N

16 What is 'Experience Replay' in DQN?

A. Replaying the game after winning.
B. Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.
C. Running the same episode multiple times.
D. Using the target network to replay actions.

17 What is the primary benefit of using Experience Replay?

A. It increases the epsilon value.
B. It breaks the correlation between consecutive samples and stabilizes training.
C. It removes the need for a target network.
D. It guarantees finding the global minimum.

18 In the context of DQN, what is the 'Target Network'?

A. The network that selects the action.
B. A copy of the main network with frozen weights used to calculate target Q-values.
C. The network that predicts the reward.
D. The network used during the testing phase only.

19 Why is a Target Network necessary in DQN?

A. To prevent the 'chasing your own tail' instability where target values shift constantly.
B. To speed up the backpropagation process.
C. To increase the exploration rate.
D. To handle continuous action spaces.

20 How are weights usually updated in the Target Network?

A. Continuous backpropagation along with the main network.
B. Copied from the main network every fixed number of steps.
C. Randomly initialized every step.
D. Updated using a different loss function.

21 What is the loss function typically used in DQN?

A. Cross-Entropy Loss
B. Mean Squared Error (MSE) between predicted Q and target Q
C. Hinge Loss
D. Kullback-Leibler Divergence

22 What main issue does 'Double DQN' address?

A. Slow convergence speed
B. Overestimation of Q-values
C. Underestimation of Q-values
D. High memory usage

23 In standard DQN, how is the target value calculated (ignoring reward and gamma)?

A. max_a Q(s', a; target_weights)
B. Q(s', argmax_a Q(s', a; main_weights); target_weights)
C. Average of Q-values
D. Minimum of Q-values

24 How does Double DQN calculate the target Q-value?

A. It uses two totally independent networks trained on different data.
B. It uses the Main network to select the best action and the Target network to evaluate its value.
C. It doubles the reward.
D. It uses the Target network to select the action and the Main network to evaluate it.

25 What is the architectural change in Dueling DQN compared to standard DQN?

A. It uses two separate neural networks for two different agents.
B. It splits the network into two streams: Value stream and Advantage stream.
C. It removes the convolutional layers.
D. It uses Recurrent Neural Networks.

26 In Dueling DQN, what does the Value function V(s) estimate?

A. How good it is to be in a particular state, regardless of the action taken.
B. How good a specific action is compared to others.
C. The immediate reward.
D. The total error.

27 In Dueling DQN, what does the Advantage function A(s, a) estimate?

A. The value of the state.
B. The importance of the state.
C. How much better taking action 'a' is compared to the average action in state 's'.
D. The probability of winning.

28 How are the Value and Advantage streams combined in Dueling DQN to get Q-values?

A. Multiplication
B. Concatenation
C. Aggregation (Summation with normalization)
D. Convolution

29 What is the main benefit of Dueling DQN?

A. It allows the agent to learn which states are valuable without having to learn the effect of every action.
B. It eliminates the need for Experience Replay.
C. It works without rewards.
D. It is computationally cheaper than standard DQN.

30 Which of the following is true about 'Off-policy' learning in Q-learning?

A. The agent learns the value of the policy it is currently executing.
B. The agent learns the value of the optimal policy regardless of the current actions taken.
C. It requires a model of the environment.
D. It cannot use Experience Replay.

31 What represents the 'State' in a Reinforcement Learning framework?

A. The feedback from the environment.
B. The decision maker.
C. The current situation or configuration of the environment.
D. The move made by the agent.

32 In the Bellman optimality equation, what does 'max_a Q(s', a)' represent?

A. The worst possible future outcome.
B. The value of the best action available in the next state.
C. The average value of the next state.
D. The immediate reward.

33 If an agent reaches a terminal state, what is the Target Q-value?

A. The immediate reward (r) only.
B. Reward + gamma * max Q.
C. Zero.
D. Infinity.

34 Which of the following creates a 'Moving Target' problem in naive Deep Q-Learning?

A. Using a fixed target network.
B. Using the same network to calculate both predicted value and target value.
C. Using a replay buffer.
D. Using a small learning rate.

35 What is 'Catastrophic Forgetting' in the context of RL?

A. The agent forgets the goal.
B. The agent forgets previously learned knowledge when training on new dissimilar experiences.
C. The replay buffer gets deleted.
D. The gradients vanish.

36 In Q-Learning, convergence to the optimal Q-values is guaranteed if:

A. The neural network is deep enough.
B. All state-action pairs are visited infinitely often and learning rate decays appropriately.
C. Epsilon is kept at 1.0.
D. The discount factor is 1.

37 Which component of the tuple (S, A, R, S') is NOT known before the agent takes an action?

A. S (Current State)
B. A (Action chosen)
C. R and S' (Reward and Next State)
D. None of the above

38 In Double DQN, the update equation replaces the target 'Y' with:

A. R + gamma * Q_target(s', argmax Q_main(s', a))
B. R + gamma * max Q_target(s', a)
C. R + gamma * Q_main(s', argmax Q_target(s', a))
D. R + gamma * V(s')

39 What is the primary motivation for 'Prioritized Experience Replay'?

A. To replay recent experiences first.
B. To replay experiences where the agent had a high TD error (learned the most).
C. To save memory.
D. To ensure random sampling.

40 When preprocessing images for DQN (e.g., Atari), what is a common technique?

A. Increasing resolution to 4K.
B. Converting to grayscale and resizing.
C. Adding noise.
D. Inverting colors.

41 In Dueling DQN, the aggregation layer usually subtracts the mean of the Advantage values. Why?

A. To reduce the size of the output.
B. For numerical stability and identifiability.
C. To make the values positive.
D. It is a requirement of the activation function.

42 What is an 'Episode' in Reinforcement Learning?

A. One step of training.
B. A sequence of states, actions, and rewards from start to a terminal state.
C. The entire training process.
D. A single update of the Q-table.

43 Which activation function is commonly used in the hidden layers of a Deep Q-Network?

A. Sigmoid
B. ReLU (Rectified Linear Unit)
C. Softmax
D. Step function

44 Why is the Softmax function generally NOT used in the output layer of a DQN?

A. It is too slow.
B. DQN outputs Q-values (regression), not probabilities (classification).
C. It cannot handle negative numbers.
D. It is not differentiable.

45 In the context of RL, what is 'Exploitation'?

A. Trying new actions to gather information.
B. Selecting the action currently believed to be optimal.
C. Stopping the training early.
D. Increasing the discount factor.

46 What is 'Frame Stacking' in DQN for Atari games?

A. Stacking Q-tables on top of each other.
B. Stacking consecutive frames to capture motion/velocity.
C. Stacking multiple neural networks.
D. Stacking rewards.

47 What optimization algorithm is typically used to train the DQN weights?

A. K-Means Clustering
B. Gradient Descent (e.g., RMSProp or Adam)
C. Genetic Algorithms
D. Principal Component Analysis

48 If the Q-values for all actions in a state are equal, what will an epsilon-greedy policy (with epsilon=0) do?

A. Choose an action randomly among them (or the first one).
B. Stop the episode.
C. Choose no action.
D. Increase epsilon.

49 Which of the following implies that an RL problem is 'episodic'?

A. The agent runs forever.
B. The task breaks down into independent sequences ending in a terminal state.
C. The discount factor is 1.
D. The environment is deterministic.

50 What is the primary reason DQN was considered a breakthrough (published by DeepMind)?

A. It solved the traveling salesman problem.
B. It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
C. It proved that gamma should always be 0.99.
D. It used a new type of CPU.