Unit 5 - Practice Quiz

INT423 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What type of Reinforcement Learning algorithm is Q-Learning?

A. Model-free, On-policy
B. Model-free, Off-policy
C. Model-based, On-policy
D. Model-based, Off-policy

2 In Q-Learning, what does the 'Q' specifically represent?

A. Quality
B. Quantity
C. Query
D. Queue

3 What is the primary data structure used in basic tabular Q-Learning?

A. A Q-Table
B. A Neural Network
C. A Graph
D. A Decision Tree

4 Which equation is used to update the Q-values in Q-Learning?

A. Euler's Equation
B. Bellman Equation
C. Maxwell's Equation
D. Schrodinger Equation

5 In the Q-learning update rule, what is the role of the learning rate (alpha)?

A. It determines the importance of future rewards.
B. It controls how much the new information overrides the old information.
C. It determines the probability of exploring a random action.
D. It calculates the total cumulative reward.

6 What is the purpose of the discount factor (gamma) in Q-Learning?

A. To balance immediate and future rewards
B. To set the learning speed
C. To initialize the Q-table
D. To determine the exploration rate

7 If the discount factor (gamma) is set to 0, what will the agent optimize for?

A. Long-term cumulative reward
B. Random rewards
C. Only the immediate reward
D. The average reward over time

8 What is the 'Temporal Difference (TD) Error' in the context of Q-Learning?

A. The difference between the target Q-value and the current predicted Q-value
B. The time it takes to converge
C. The error in the reward function
D. The difference between the current Q-value and the previous Q-value

9 What is the Epsilon-Greedy strategy used for?

A. To calculate the loss function
B. To balance exploration and exploitation
C. To update the weights of the network
D. To store experiences in replay memory

10 In an Epsilon-Greedy strategy, what happens if epsilon is 1?

A. The agent alternates between best and random actions.
B. The agent stops learning.
C. The agent always chooses a random action.
D. The agent always chooses the action with the highest Q-value.

11 What is the typical behavior of epsilon during the training process in Deep Q-Learning?

A. It starts low and increases over time.
B. It fluctuates randomly.
C. It remains constant throughout training.
D. It starts high and decays over time.

12 Why does tabular Q-Learning fail in environments like Atari games or Robotics?

A. The rewards are not defined.
B. The math does not apply to games.
C. The state space is too large (Curse of Dimensionality).
D. Q-learning cannot handle discrete actions.

13 What replaces the Q-Table in a Deep Q-Network (DQN)?

A. A Deep Neural Network
B. A Genetic Algorithm
C. A Linear Regression model
D. A larger Q-Table

14 What is the input to the neural network in a standard DQN for playing video games?

A. The raw pixels of the game screen (state)
B. The Q-value
C. The action to be taken
D. The current score

15 What is the output layer size of a DQN used for an environment with 'N' discrete actions?

A. 1 (The value of the state)
B. 1 (The best action)
C. N x N
D. N (One Q-value for each action)

16 What is 'Experience Replay' in DQN?

A. Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.
B. Running the same episode multiple times.
C. Replaying the game after winning.
D. Using the target network to replay actions.

17 What is the primary benefit of using Experience Replay?

A. It removes the need for a target network.
B. It guarantees finding the global minimum.
C. It increases the epsilon value.
D. It breaks the correlation between consecutive samples and stabilizes training.

18 In the context of DQN, what is the 'Target Network'?

A. The network used during the testing phase only.
B. The network that selects the action.
C. A copy of the main network with frozen weights used to calculate target Q-values.
D. The network that predicts the reward.

19 Why is a Target Network necessary in DQN?

A. To increase the exploration rate.
B. To speed up the backpropagation process.
C. To prevent the 'chasing your own tail' instability where target values shift constantly.
D. To handle continuous action spaces.

20 How are weights usually updated in the Target Network?

A. Updated using a different loss function.
B. Copied from the main network every fixed number of steps.
C. Continuous backpropagation along with the main network.
D. Randomly initialized every step.

21 What is the loss function typically used in DQN?

A. Cross-Entropy Loss
B. Hinge Loss
C. Mean Squared Error (MSE) between predicted Q and target Q
D. Kullback-Leibler Divergence

22 What main issue does 'Double DQN' address?

A. Underestimation of Q-values
B. Overestimation of Q-values
C. High memory usage
D. Slow convergence speed

23 In standard DQN, how is the target value calculated (ignoring reward and gamma)?

A. max_a Q(s', a; target_weights)
B. Average of Q-values
C. Minimum of Q-values
D. Q(s', argmax_a Q(s', a; main_weights); target_weights)

24 How does Double DQN calculate the target Q-value?

A. It uses the Target network to select the action and the Main network to evaluate it.
B. It uses the Main network to select the best action and the Target network to evaluate its value.
C. It uses two totally independent networks trained on different data.
D. It doubles the reward.

25 What is the architectural change in Dueling DQN compared to standard DQN?

A. It removes the convolutional layers.
B. It uses two separate neural networks for two different agents.
C. It splits the network into two streams: Value stream and Advantage stream.
D. It uses Recurrent Neural Networks.

26 In Dueling DQN, what does the Value function V(s) estimate?

A. The immediate reward.
B. How good it is to be in a particular state, regardless of the action taken.
C. How good a specific action is compared to others.
D. The total error.

27 In Dueling DQN, what does the Advantage function A(s, a) estimate?

A. The value of the state.
B. The probability of winning.
C. The importance of the state.
D. How much better taking action 'a' is compared to the average action in state 's'.

28 How are the Value and Advantage streams combined in Dueling DQN to get Q-values?

A. Multiplication
B. Concatenation
C. Aggregation (Summation with normalization)
D. Convolution

29 What is the main benefit of Dueling DQN?

A. It eliminates the need for Experience Replay.
B. It allows the agent to learn which states are valuable without having to learn the effect of every action.
C. It works without rewards.
D. It is computationally cheaper than standard DQN.

30 Which of the following is true about 'Off-policy' learning in Q-learning?

A. The agent learns the value of the policy it is currently executing.
B. It requires a model of the environment.
C. It cannot use Experience Replay.
D. The agent learns the value of the optimal policy regardless of the current actions taken.

31 What represents the 'State' in a Reinforcement Learning framework?

A. The current situation or configuration of the environment.
B. The move made by the agent.
C. The feedback from the environment.
D. The decision maker.

32 In the Bellman optimality equation, what does 'max_a Q(s', a)' represent?

A. The average value of the next state.
B. The immediate reward.
C. The worst possible future outcome.
D. The value of the best action available in the next state.

33 If an agent reaches a terminal state, what is the Target Q-value?

A. The immediate reward (r) only.
B. Reward + gamma * max Q.
C. Zero.
D. Infinity.

34 Which of the following creates a 'Moving Target' problem in naive Deep Q-Learning?

A. Using a fixed target network.
B. Using a small learning rate.
C. Using a replay buffer.
D. Using the same network to calculate both predicted value and target value.

35 What is 'Catastrophic Forgetting' in the context of RL?

A. The gradients vanish.
B. The agent forgets previously learned knowledge when training on new dissimilar experiences.
C. The replay buffer gets deleted.
D. The agent forgets the goal.

36 In Q-Learning, convergence to the optimal Q-values is guaranteed if:

A. Epsilon is kept at 1.0.
B. The neural network is deep enough.
C. All state-action pairs are visited infinitely often and learning rate decays appropriately.
D. The discount factor is 1.

37 Which component of the tuple (S, A, R, S') is NOT known before the agent takes an action?

A. S (Current State)
B. A (Action chosen)
C. None of the above
D. R and S' (Reward and Next State)

38 In Double DQN, the update equation replaces the target 'Y' with:

A. R + gamma * Q_main(s', argmax Q_target(s', a))
B. R + gamma * V(s')
C. R + gamma * max Q_target(s', a)
D. R + gamma * Q_target(s', argmax Q_main(s', a))

39 What is the primary motivation for 'Prioritized Experience Replay'?

A. To replay experiences where the agent had a high TD error (learned the most).
B. To replay recent experiences first.
C. To save memory.
D. To ensure random sampling.

40 When preprocessing images for DQN (e.g., Atari), what is a common technique?

A. Adding noise.
B. Converting to grayscale and resizing.
C. Increasing resolution to 4K.
D. Inverting colors.

41 In Dueling DQN, the aggregation layer usually subtracts the mean of the Advantage values. Why?

A. To reduce the size of the output.
B. For numerical stability and identifiability.
C. It is a requirement of the activation function.
D. To make the values positive.

42 What is an 'Episode' in Reinforcement Learning?

A. A single update of the Q-table.
B. A sequence of states, actions, and rewards from start to a terminal state.
C. The entire training process.
D. One step of training.

43 Which activation function is commonly used in the hidden layers of a Deep Q-Network?

A. Sigmoid
B. Step function
C. ReLU (Rectified Linear Unit)
D. Softmax

44 Why is the Softmax function generally NOT used in the output layer of a DQN?

A. It cannot handle negative numbers.
B. DQN outputs Q-values (regression), not probabilities (classification).
C. It is not differentiable.
D. It is too slow.

45 In the context of RL, what is 'Exploitation'?

A. Increasing the discount factor.
B. Trying new actions to gather information.
C. Stopping the training early.
D. Selecting the action currently believed to be optimal.

46 What is 'Frame Stacking' in DQN for Atari games?

A. Stacking Q-tables on top of each other.
B. Stacking multiple neural networks.
C. Stacking rewards.
D. Stacking consecutive frames to capture motion/velocity.

47 What optimization algorithm is typically used to train the DQN weights?

A. K-Means Clustering
B. Genetic Algorithms
C. Gradient Descent (e.g., RMSProp or Adam)
D. Principal Component Analysis

48 If the Q-values for all actions in a state are equal, what will an epsilon-greedy policy (with epsilon=0) do?

A. Choose no action.
B. Increase epsilon.
C. Stop the episode.
D. Choose an action randomly among them (or the first one).

49 Which of the following implies that an RL problem is 'episodic'?

A. The task breaks down into independent sequences ending in a terminal state.
B. The discount factor is 1.
C. The environment is deterministic.
D. The agent runs forever.

50 What is the primary reason DQN was considered a breakthrough (published by DeepMind)?

A. It proved that gamma should always be 0.99.
B. It used a new type of CPU.
C. It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
D. It solved the traveling salesman problem.