Unit 5 - Practice Quiz

INT423 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What type of Reinforcement Learning algorithm is Q-Learning?

A. Model-based, On-policy
B. Model-free, On-policy
C. Model-free, Off-policy
D. Model-based, Off-policy

2 In Q-Learning, what does the 'Q' specifically represent?

A. Quality
B. Query
C. Queue
D. Quantity

3 What is the primary data structure used in basic tabular Q-Learning?

A. A Q-Table
B. A Decision Tree
C. A Neural Network
D. A Graph

4 Which equation is used to update the Q-values in Q-Learning?

A. Schrodinger Equation
B. Bellman Equation
C. Maxwell's Equation
D. Euler's Equation

5 In the Q-learning update rule, what is the role of the learning rate (alpha)?

A. It determines the probability of exploring a random action.
B. It calculates the total cumulative reward.
C. It determines the importance of future rewards.
D. It controls how much the new information overrides the old information.

6 What is the purpose of the discount factor (gamma) in Q-Learning?

A. To determine the exploration rate
B. To set the learning speed
C. To initialize the Q-table
D. To balance immediate and future rewards

7 If the discount factor (gamma) is set to 0, what will the agent optimize for?

A. Random rewards
B. Long-term cumulative reward
C. Only the immediate reward
D. The average reward over time

8 What is the 'Temporal Difference (TD) Error' in the context of Q-Learning?

A. The difference between the target Q-value and the current predicted Q-value
B. The time it takes to converge
C. The error in the reward function
D. The difference between the current Q-value and the previous Q-value

9 What is the Epsilon-Greedy strategy used for?

A. To calculate the loss function
B. To store experiences in replay memory
C. To balance exploration and exploitation
D. To update the weights of the network

10 In an Epsilon-Greedy strategy, what happens if epsilon is 1?

A. The agent stops learning.
B. The agent always chooses a random action.
C. The agent alternates between best and random actions.
D. The agent always chooses the action with the highest Q-value.

11 What is the typical behavior of epsilon during the training process in Deep Q-Learning?

A. It starts high and decays over time.
B. It fluctuates randomly.
C. It remains constant throughout training.
D. It starts low and increases over time.

12 Why does tabular Q-Learning fail in environments like Atari games or Robotics?

A. The math does not apply to games.
B. Q-learning cannot handle discrete actions.
C. The rewards are not defined.
D. The state space is too large (Curse of Dimensionality).

13 What replaces the Q-Table in a Deep Q-Network (DQN)?

A. A larger Q-Table
B. A Genetic Algorithm
C. A Linear Regression model
D. A Deep Neural Network

14 What is the input to the neural network in a standard DQN for playing video games?

A. The Q-value
B. The current score
C. The raw pixels of the game screen (state)
D. The action to be taken

15 What is the output layer size of a DQN used for an environment with 'N' discrete actions?

A. 1 (The best action)
B. N x N
C. N (One Q-value for each action)
D. 1 (The value of the state)

16 What is 'Experience Replay' in DQN?

A. Running the same episode multiple times.
B. Using the target network to replay actions.
C. Replaying the game after winning.
D. Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.

17 What is the primary benefit of using Experience Replay?

A. It guarantees finding the global minimum.
B. It breaks the correlation between consecutive samples and stabilizes training.
C. It increases the epsilon value.
D. It removes the need for a target network.

18 In the context of DQN, what is the 'Target Network'?

A. The network that selects the action.
B. The network that predicts the reward.
C. A copy of the main network with frozen weights used to calculate target Q-values.
D. The network used during the testing phase only.

19 Why is a Target Network necessary in DQN?

A. To handle continuous action spaces.
B. To speed up the backpropagation process.
C. To prevent the 'chasing your own tail' instability where target values shift constantly.
D. To increase the exploration rate.

20 How are weights usually updated in the Target Network?

A. Updated using a different loss function.
B. Randomly initialized every step.
C. Continuous backpropagation along with the main network.
D. Copied from the main network every fixed number of steps.

21 What is the loss function typically used in DQN?

A. Hinge Loss
B. Cross-Entropy Loss
C. Kullback-Leibler Divergence
D. Mean Squared Error (MSE) between predicted Q and target Q

22 What main issue does 'Double DQN' address?

A. Overestimation of Q-values
B. Underestimation of Q-values
C. Slow convergence speed
D. High memory usage

23 In standard DQN, how is the target value calculated (ignoring reward and gamma)?

A. Average of Q-values
B. Minimum of Q-values
C. Q(s', argmax_a Q(s', a; main_weights); target_weights)
D. max_a Q(s', a; target_weights)

24 How does Double DQN calculate the target Q-value?

A. It uses the Main network to select the best action and the Target network to evaluate its value.
B. It uses the Target network to select the action and the Main network to evaluate it.
C. It uses two totally independent networks trained on different data.
D. It doubles the reward.

25 What is the architectural change in Dueling DQN compared to standard DQN?

A. It uses two separate neural networks for two different agents.
B. It removes the convolutional layers.
C. It uses Recurrent Neural Networks.
D. It splits the network into two streams: Value stream and Advantage stream.

26 In Dueling DQN, what does the Value function V(s) estimate?

A. The total error.
B. How good a specific action is compared to others.
C. How good it is to be in a particular state, regardless of the action taken.
D. The immediate reward.

27 In Dueling DQN, what does the Advantage function A(s, a) estimate?

A. The importance of the state.
B. The value of the state.
C. How much better taking action 'a' is compared to the average action in state 's'.
D. The probability of winning.

28 How are the Value and Advantage streams combined in Dueling DQN to get Q-values?

A. Aggregation (Summation with normalization)
B. Concatenation
C. Convolution
D. Multiplication

29 What is the main benefit of Dueling DQN?

A. It works without rewards.
B. It eliminates the need for Experience Replay.
C. It is computationally cheaper than standard DQN.
D. It allows the agent to learn which states are valuable without having to learn the effect of every action.

30 Which of the following is true about 'Off-policy' learning in Q-learning?

A. It cannot use Experience Replay.
B. It requires a model of the environment.
C. The agent learns the value of the optimal policy regardless of the current actions taken.
D. The agent learns the value of the policy it is currently executing.

31 What represents the 'State' in a Reinforcement Learning framework?

A. The current situation or configuration of the environment.
B. The move made by the agent.
C. The feedback from the environment.
D. The decision maker.

32 In the Bellman optimality equation, what does 'max_a Q(s', a)' represent?

A. The average value of the next state.
B. The value of the best action available in the next state.
C. The immediate reward.
D. The worst possible future outcome.

33 If an agent reaches a terminal state, what is the Target Q-value?

A. The immediate reward (r) only.
B. Zero.
C. Reward + gamma * max Q.
D. Infinity.

34 Which of the following creates a 'Moving Target' problem in naive Deep Q-Learning?

A. Using the same network to calculate both predicted value and target value.
B. Using a small learning rate.
C. Using a fixed target network.
D. Using a replay buffer.

35 What is 'Catastrophic Forgetting' in the context of RL?

A. The agent forgets previously learned knowledge when training on new dissimilar experiences.
B. The gradients vanish.
C. The agent forgets the goal.
D. The replay buffer gets deleted.

36 In Q-Learning, convergence to the optimal Q-values is guaranteed if:

A. The neural network is deep enough.
B. All state-action pairs are visited infinitely often and learning rate decays appropriately.
C. Epsilon is kept at 1.0.
D. The discount factor is 1.

37 Which component of the tuple (S, A, R, S') is NOT known before the agent takes an action?

A. R and S' (Reward and Next State)
B. None of the above
C. A (Action chosen)
D. S (Current State)

38 In Double DQN, the update equation replaces the target 'Y' with:

A. R + gamma * Q_target(s', argmax Q_main(s', a))
B. R + gamma * V(s')
C. R + gamma * max Q_target(s', a)
D. R + gamma * Q_main(s', argmax Q_target(s', a))

39 What is the primary motivation for 'Prioritized Experience Replay'?

A. To replay recent experiences first.
B. To ensure random sampling.
C. To save memory.
D. To replay experiences where the agent had a high TD error (learned the most).

40 When preprocessing images for DQN (e.g., Atari), what is a common technique?

A. Inverting colors.
B. Increasing resolution to 4K.
C. Adding noise.
D. Converting to grayscale and resizing.

41 In Dueling DQN, the aggregation layer usually subtracts the mean of the Advantage values. Why?

A. To reduce the size of the output.
B. It is a requirement of the activation function.
C. To make the values positive.
D. For numerical stability and identifiability.

42 What is an 'Episode' in Reinforcement Learning?

A. One step of training.
B. The entire training process.
C. A sequence of states, actions, and rewards from start to a terminal state.
D. A single update of the Q-table.

43 Which activation function is commonly used in the hidden layers of a Deep Q-Network?

A. Sigmoid
B. Step function
C. ReLU (Rectified Linear Unit)
D. Softmax

44 Why is the Softmax function generally NOT used in the output layer of a DQN?

A. It cannot handle negative numbers.
B. It is not differentiable.
C. DQN outputs Q-values (regression), not probabilities (classification).
D. It is too slow.

45 In the context of RL, what is 'Exploitation'?

A. Increasing the discount factor.
B. Selecting the action currently believed to be optimal.
C. Stopping the training early.
D. Trying new actions to gather information.

46 What is 'Frame Stacking' in DQN for Atari games?

A. Stacking Q-tables on top of each other.
B. Stacking rewards.
C. Stacking multiple neural networks.
D. Stacking consecutive frames to capture motion/velocity.

47 What optimization algorithm is typically used to train the DQN weights?

A. Genetic Algorithms
B. Gradient Descent (e.g., RMSProp or Adam)
C. K-Means Clustering
D. Principal Component Analysis

48 If the Q-values for all actions in a state are equal, what will an epsilon-greedy policy (with epsilon=0) do?

A. Increase epsilon.
B. Choose an action randomly among them (or the first one).
C. Choose no action.
D. Stop the episode.

49 Which of the following implies that an RL problem is 'episodic'?

A. The environment is deterministic.
B. The agent runs forever.
C. The discount factor is 1.
D. The task breaks down into independent sequences ending in a terminal state.

50 What is the primary reason DQN was considered a breakthrough (published by DeepMind)?

A. It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
B. It solved the traveling salesman problem.
C. It used a new type of CPU.
D. It proved that gamma should always be 0.99.