1What type of Reinforcement Learning algorithm is Q-Learning?
A.Model-based, On-policy
B.Model-free, On-policy
C.Model-free, Off-policy
D.Model-based, Off-policy
Correct Answer: Model-free, Off-policy
Explanation:
Q-Learning is a model-free algorithm because it does not learn a model of the environment, and it is off-policy because it learns the value of the optimal policy independently of the agent's actions.
Incorrect! Try again.
2In Q-Learning, what does the 'Q' specifically represent?
A.Quality
B.Query
C.Queue
D.Quantity
Correct Answer: Quality
Explanation:
The 'Q' stands for Quality, representing how useful a given action is in gaining some future reward.
Incorrect! Try again.
3What is the primary data structure used in basic tabular Q-Learning?
A.A Q-Table
B.A Decision Tree
C.A Neural Network
D.A Graph
Correct Answer: A Q-Table
Explanation:
Tabular Q-Learning uses a Q-Table (lookup table) to store the Q-values for every state-action pair.
Incorrect! Try again.
4Which equation is used to update the Q-values in Q-Learning?
A.Schrodinger Equation
B.Bellman Equation
C.Maxwell's Equation
D.Euler's Equation
Correct Answer: Bellman Equation
Explanation:
The Bellman Equation provides the recursive relationship used to update Q-values based on the reward and the value of the next state.
Incorrect! Try again.
5In the Q-learning update rule, what is the role of the learning rate (alpha)?
A.It determines the probability of exploring a random action.
B.It calculates the total cumulative reward.
C.It determines the importance of future rewards.
D.It controls how much the new information overrides the old information.
Correct Answer: It controls how much the new information overrides the old information.
Explanation:
The learning rate (alpha) determines the extent to which newly acquired information overrides old information. Alpha=0 means nothing is learned; Alpha=1 means only the most recent information counts.
Incorrect! Try again.
6What is the purpose of the discount factor (gamma) in Q-Learning?
A.To determine the exploration rate
B.To set the learning speed
C.To initialize the Q-table
D.To balance immediate and future rewards
Correct Answer: To balance immediate and future rewards
Explanation:
Gamma determines the present value of future rewards. A gamma of 0 makes the agent short-sighted (only cares about current reward), while a gamma close to 1 makes it far-sighted.
Incorrect! Try again.
7If the discount factor (gamma) is set to 0, what will the agent optimize for?
A.Random rewards
B.Long-term cumulative reward
C.Only the immediate reward
D.The average reward over time
Correct Answer: Only the immediate reward
Explanation:
With gamma = 0, future rewards are multiplied by 0, so the agent only considers the immediate reward (r) received from the current action.
Incorrect! Try again.
8What is the 'Temporal Difference (TD) Error' in the context of Q-Learning?
A.The difference between the target Q-value and the current predicted Q-value
B.The time it takes to converge
C.The error in the reward function
D.The difference between the current Q-value and the previous Q-value
Correct Answer: The difference between the target Q-value and the current predicted Q-value
Explanation:
TD Error is the difference between the estimated value of the current state (Target: Reward + discounted max future Q) and the current stored Q-value.
Incorrect! Try again.
9What is the Epsilon-Greedy strategy used for?
A.To calculate the loss function
B.To store experiences in replay memory
C.To balance exploration and exploitation
D.To update the weights of the network
Correct Answer: To balance exploration and exploitation
Explanation:
Epsilon-greedy is a policy that balances exploration (choosing random actions) and exploitation (choosing the best known action) based on probability epsilon.
Incorrect! Try again.
10In an Epsilon-Greedy strategy, what happens if epsilon is 1?
A.The agent stops learning.
B.The agent always chooses a random action.
C.The agent alternates between best and random actions.
D.The agent always chooses the action with the highest Q-value.
Correct Answer: The agent always chooses a random action.
Explanation:
If epsilon is 1, the probability of exploring (choosing a random action) is 100%.
Incorrect! Try again.
11What is the typical behavior of epsilon during the training process in Deep Q-Learning?
A.It starts high and decays over time.
B.It fluctuates randomly.
C.It remains constant throughout training.
D.It starts low and increases over time.
Correct Answer: It starts high and decays over time.
Explanation:
Epsilon usually starts high to encourage exploration of the environment early on and decays (decreases) over time to exploit the learned policy as it improves.
Incorrect! Try again.
12Why does tabular Q-Learning fail in environments like Atari games or Robotics?
A.The math does not apply to games.
B.Q-learning cannot handle discrete actions.
C.The rewards are not defined.
D.The state space is too large (Curse of Dimensionality).
Correct Answer: The state space is too large (Curse of Dimensionality).
Explanation:
Tabular Q-learning requires a row for every possible state. In environments with high-dimensional inputs (like image pixels), the number of states is astronomically large, making a table impossible to store or fill.
Incorrect! Try again.
13What replaces the Q-Table in a Deep Q-Network (DQN)?
A.A larger Q-Table
B.A Genetic Algorithm
C.A Linear Regression model
D.A Deep Neural Network
Correct Answer: A Deep Neural Network
Explanation:
In DQN, a neural network is used as a function approximator to estimate Q-values for states, rather than storing them in a table.
Incorrect! Try again.
14What is the input to the neural network in a standard DQN for playing video games?
A.The Q-value
B.The current score
C.The raw pixels of the game screen (state)
D.The action to be taken
Correct Answer: The raw pixels of the game screen (state)
Explanation:
The network takes the state (usually preprocessed screen pixels) as input and outputs Q-values for all possible actions.
Incorrect! Try again.
15What is the output layer size of a DQN used for an environment with 'N' discrete actions?
A.1 (The best action)
B.N x N
C.N (One Q-value for each action)
D.1 (The value of the state)
Correct Answer: N (One Q-value for each action)
Explanation:
A DQN takes a state as input and outputs a vector of size N, representing the predicted Q-value for each of the N possible actions.
Incorrect! Try again.
16What is 'Experience Replay' in DQN?
A.Running the same episode multiple times.
B.Using the target network to replay actions.
C.Replaying the game after winning.
D.Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.
Correct Answer: Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.
Explanation:
Experience Replay involves storing agent experiences in a buffer and randomly sampling them during training to break temporal correlations between consecutive samples.
Incorrect! Try again.
17What is the primary benefit of using Experience Replay?
A.It guarantees finding the global minimum.
B.It breaks the correlation between consecutive samples and stabilizes training.
C.It increases the epsilon value.
D.It removes the need for a target network.
Correct Answer: It breaks the correlation between consecutive samples and stabilizes training.
Explanation:
Sequential samples in RL are highly correlated. Random sampling from a replay buffer creates an i.i.d. (independent and identically distributed) data setting, which is better for neural network training.
Incorrect! Try again.
18In the context of DQN, what is the 'Target Network'?
A.The network that selects the action.
B.The network that predicts the reward.
C.A copy of the main network with frozen weights used to calculate target Q-values.
D.The network used during the testing phase only.
Correct Answer: A copy of the main network with frozen weights used to calculate target Q-values.
Explanation:
The Target Network is a clone of the online (policy) network. Its weights are frozen and updated less frequently to provide a stable target for the loss function calculation.
Incorrect! Try again.
19Why is a Target Network necessary in DQN?
A.To handle continuous action spaces.
B.To speed up the backpropagation process.
C.To prevent the 'chasing your own tail' instability where target values shift constantly.
D.To increase the exploration rate.
Correct Answer: To prevent the 'chasing your own tail' instability where target values shift constantly.
Explanation:
Without a target network, the update modifies the same network used to calculate the target value, causing the target to shift immediately, which leads to oscillation and divergence.
Incorrect! Try again.
20How are weights usually updated in the Target Network?
A.Updated using a different loss function.
B.Randomly initialized every step.
C.Continuous backpropagation along with the main network.
D.Copied from the main network every fixed number of steps.
Correct Answer: Copied from the main network every fixed number of steps.
Explanation:
The weights of the main network are copied to the target network periodically (e.g., every 1000 steps) or via a soft update (Polyak averaging).
Incorrect! Try again.
21What is the loss function typically used in DQN?
A.Hinge Loss
B.Cross-Entropy Loss
C.Kullback-Leibler Divergence
D.Mean Squared Error (MSE) between predicted Q and target Q
Correct Answer: Mean Squared Error (MSE) between predicted Q and target Q
Explanation:
DQN treats the Bellman update as a regression problem, minimizing the squared difference between the current Q-value and the target Q-value (Reward + gamma * max Q_next).
Incorrect! Try again.
22What main issue does 'Double DQN' address?
A.Overestimation of Q-values
B.Underestimation of Q-values
C.Slow convergence speed
D.High memory usage
Correct Answer: Overestimation of Q-values
Explanation:
Standard DQN tends to overestimate Q-values because the max operator is used for both action selection and evaluation. Double DQN addresses this maximization bias.
Incorrect! Try again.
23In standard DQN, how is the target value calculated (ignoring reward and gamma)?
Standard DQN uses the max Q-value of the next state directly from the target network, effectively selecting and evaluating the action using the same set of weights.
Incorrect! Try again.
24How does Double DQN calculate the target Q-value?
A.It uses the Main network to select the best action and the Target network to evaluate its value.
B.It uses the Target network to select the action and the Main network to evaluate it.
C.It uses two totally independent networks trained on different data.
D.It doubles the reward.
Correct Answer: It uses the Main network to select the best action and the Target network to evaluate its value.
Explanation:
Double DQN decouples selection and evaluation: Action = argmax Q(Online Network), Value = Q(Target Network, Action).
Incorrect! Try again.
25What is the architectural change in Dueling DQN compared to standard DQN?
A.It uses two separate neural networks for two different agents.
B.It removes the convolutional layers.
C.It uses Recurrent Neural Networks.
D.It splits the network into two streams: Value stream and Advantage stream.
Correct Answer: It splits the network into two streams: Value stream and Advantage stream.
Explanation:
Dueling DQN separates the estimator into two streams after the convolutional layers: one for the state Value function V(s) and one for the Advantage function A(s,a).
Incorrect! Try again.
26In Dueling DQN, what does the Value function V(s) estimate?
A.The total error.
B.How good a specific action is compared to others.
C.How good it is to be in a particular state, regardless of the action taken.
D.The immediate reward.
Correct Answer: How good it is to be in a particular state, regardless of the action taken.
Explanation:
V(s) represents the scalar value of the state itself, independent of the specific action chosen.
Incorrect! Try again.
27In Dueling DQN, what does the Advantage function A(s, a) estimate?
A.The importance of the state.
B.The value of the state.
C.How much better taking action 'a' is compared to the average action in state 's'.
D.The probability of winning.
Correct Answer: How much better taking action 'a' is compared to the average action in state 's'.
Explanation:
The Advantage function captures the relative importance of each action. Q(s,a) = V(s) + A(s,a).
Incorrect! Try again.
28How are the Value and Advantage streams combined in Dueling DQN to get Q-values?
A.Aggregation (Summation with normalization)
B.Concatenation
C.Convolution
D.Multiplication
Correct Answer: Aggregation (Summation with normalization)
Explanation:
They are aggregated via addition, usually subtracting the mean (or max) of the advantage to ensure identifiability: Q = V + (A - mean(A)).
Incorrect! Try again.
29What is the main benefit of Dueling DQN?
A.It works without rewards.
B.It eliminates the need for Experience Replay.
C.It is computationally cheaper than standard DQN.
D.It allows the agent to learn which states are valuable without having to learn the effect of every action.
Correct Answer: It allows the agent to learn which states are valuable without having to learn the effect of every action.
Explanation:
By decoupling V and A, the network can learn general state values more efficiently, which is useful in states where the choice of action doesn't affect the outcome much.
Incorrect! Try again.
30Which of the following is true about 'Off-policy' learning in Q-learning?
A.It cannot use Experience Replay.
B.It requires a model of the environment.
C.The agent learns the value of the optimal policy regardless of the current actions taken.
D.The agent learns the value of the policy it is currently executing.
Correct Answer: The agent learns the value of the optimal policy regardless of the current actions taken.
Explanation:
Off-policy means the target policy (optimal, greedy) is different from the behavior policy (epsilon-greedy). Q-learning approximates optimal Q-values even while exploring.
Incorrect! Try again.
31What represents the 'State' in a Reinforcement Learning framework?
A.The current situation or configuration of the environment.
B.The move made by the agent.
C.The feedback from the environment.
D.The decision maker.
Correct Answer: The current situation or configuration of the environment.
Explanation:
The State (S) is the observation or configuration of the environment at a specific time step.
Incorrect! Try again.
32In the Bellman optimality equation, what does 'max_a Q(s', a)' represent?
A.The average value of the next state.
B.The value of the best action available in the next state.
C.The immediate reward.
D.The worst possible future outcome.
Correct Answer: The value of the best action available in the next state.
Explanation:
This term represents the estimated maximum cumulative future reward achievable from the next state s'.
Incorrect! Try again.
33If an agent reaches a terminal state, what is the Target Q-value?
A.The immediate reward (r) only.
B.Zero.
C.Reward + gamma * max Q.
D.Infinity.
Correct Answer: The immediate reward (r) only.
Explanation:
In a terminal state, there is no next state, so the discounted future value is 0. The target is just the immediate reward.
Incorrect! Try again.
34Which of the following creates a 'Moving Target' problem in naive Deep Q-Learning?
A.Using the same network to calculate both predicted value and target value.
B.Using a small learning rate.
C.Using a fixed target network.
D.Using a replay buffer.
Correct Answer: Using the same network to calculate both predicted value and target value.
Explanation:
If the same network weights are used for prediction and target calculation, every update shifts the target, making convergence difficult (like a dog chasing its tail).
Incorrect! Try again.
35What is 'Catastrophic Forgetting' in the context of RL?
A.The agent forgets previously learned knowledge when training on new dissimilar experiences.
B.The gradients vanish.
C.The agent forgets the goal.
D.The replay buffer gets deleted.
Correct Answer: The agent forgets previously learned knowledge when training on new dissimilar experiences.
Explanation:
This happens when a neural network overwrites weights optimized for past experiences to fit new, correlated experiences. Experience replay helps mitigate this.
Incorrect! Try again.
36In Q-Learning, convergence to the optimal Q-values is guaranteed if:
A.The neural network is deep enough.
B.All state-action pairs are visited infinitely often and learning rate decays appropriately.
C.Epsilon is kept at 1.0.
D.The discount factor is 1.
Correct Answer: All state-action pairs are visited infinitely often and learning rate decays appropriately.
Explanation:
For tabular Q-learning, convergence is proven if every state-action pair is visited infinitely often and the step sizes satisfy specific decay conditions (Robbins-Monro).
Incorrect! Try again.
37Which component of the tuple (S, A, R, S') is NOT known before the agent takes an action?
A.R and S' (Reward and Next State)
B.None of the above
C.A (Action chosen)
D.S (Current State)
Correct Answer: R and S' (Reward and Next State)
Explanation:
The agent knows the current state and chooses an action. The Reward and Next State are returned by the environment after the action is executed.
Incorrect! Try again.
38In Double DQN, the update equation replaces the target 'Y' with:
A.R + gamma * Q_target(s', argmax Q_main(s', a))
B.R + gamma * V(s')
C.R + gamma * max Q_target(s', a)
D.R + gamma * Q_main(s', argmax Q_target(s', a))
Correct Answer: R + gamma * Q_target(s', argmax Q_main(s', a))
Explanation:
This is the mathematical representation of Double DQN: select action using Main (argmax Q_main), evaluate using Target (Q_target).
Incorrect! Try again.
39What is the primary motivation for 'Prioritized Experience Replay'?
A.To replay recent experiences first.
B.To ensure random sampling.
C.To save memory.
D.To replay experiences where the agent had a high TD error (learned the most).
Correct Answer: To replay experiences where the agent had a high TD error (learned the most).
Explanation:
Prioritized Experience Replay samples transitions with high TD error more frequently because these are the transitions where the agent's current predictions are most wrong (surprising).
Incorrect! Try again.
40When preprocessing images for DQN (e.g., Atari), what is a common technique?
A.Inverting colors.
B.Increasing resolution to 4K.
C.Adding noise.
D.Converting to grayscale and resizing.
Correct Answer: Converting to grayscale and resizing.
Explanation:
To reduce computational complexity, images are typically converted to grayscale and downsampled (e.g., to 84x84) before being fed into the CNN.
Incorrect! Try again.
41In Dueling DQN, the aggregation layer usually subtracts the mean of the Advantage values. Why?
A.To reduce the size of the output.
B.It is a requirement of the activation function.
C.To make the values positive.
D.For numerical stability and identifiability.
Correct Answer: For numerical stability and identifiability.
Explanation:
Subtracting the mean (or max) ensures the Q-value is unique (Identifiability). Without this, V(s) and A(s,a) could shift by arbitrary constants without changing Q.
Incorrect! Try again.
42What is an 'Episode' in Reinforcement Learning?
A.One step of training.
B.The entire training process.
C.A sequence of states, actions, and rewards from start to a terminal state.
D.A single update of the Q-table.
Correct Answer: A sequence of states, actions, and rewards from start to a terminal state.
Explanation:
An episode is a complete run of the task, starting from an initial state and ending at a terminal state (e.g., game over or goal reached).
Incorrect! Try again.
43Which activation function is commonly used in the hidden layers of a Deep Q-Network?
A.Sigmoid
B.Step function
C.ReLU (Rectified Linear Unit)
D.Softmax
Correct Answer: ReLU (Rectified Linear Unit)
Explanation:
ReLU is standard for hidden layers in CNNs and DQNs due to efficient computation and mitigation of the vanishing gradient problem.
Incorrect! Try again.
44Why is the Softmax function generally NOT used in the output layer of a DQN?
A.It cannot handle negative numbers.
B.It is not differentiable.
C.DQN outputs Q-values (regression), not probabilities (classification).
D.It is too slow.
Correct Answer: DQN outputs Q-values (regression), not probabilities (classification).
Explanation:
DQN predicts expected future rewards (Q-values), which can be any real number. Softmax forces outputs to sum to 1, which applies to probabilities, not value estimation.
Incorrect! Try again.
45In the context of RL, what is 'Exploitation'?
A.Increasing the discount factor.
B.Selecting the action currently believed to be optimal.
C.Stopping the training early.
D.Trying new actions to gather information.
Correct Answer: Selecting the action currently believed to be optimal.
Explanation:
Exploitation means using the current knowledge (Q-values) to maximize reward, typically by choosing the action with the highest Q-value.
Incorrect! Try again.
46What is 'Frame Stacking' in DQN for Atari games?
A.Stacking Q-tables on top of each other.
B.Stacking rewards.
C.Stacking multiple neural networks.
D.Stacking consecutive frames to capture motion/velocity.
Correct Answer: Stacking consecutive frames to capture motion/velocity.
Explanation:
A single static frame doesn't show direction or speed. Stacking usually 4 consecutive frames creates a state representation that includes temporal information.
Incorrect! Try again.
47What optimization algorithm is typically used to train the DQN weights?
A.Genetic Algorithms
B.Gradient Descent (e.g., RMSProp or Adam)
C.K-Means Clustering
D.Principal Component Analysis
Correct Answer: Gradient Descent (e.g., RMSProp or Adam)
Explanation:
Since DQN is a neural network minimizing a loss function (MSE), gradient descent optimizers like RMSProp or Adam are used.
Incorrect! Try again.
48If the Q-values for all actions in a state are equal, what will an epsilon-greedy policy (with epsilon=0) do?
A.Increase epsilon.
B.Choose an action randomly among them (or the first one).
C.Choose no action.
D.Stop the episode.
Correct Answer: Choose an action randomly among them (or the first one).
Explanation:
If values are tied, the argmax function typically breaks ties arbitrarily (e.g., first index) or randomly among the tied best actions.
Incorrect! Try again.
49Which of the following implies that an RL problem is 'episodic'?
A.The environment is deterministic.
B.The agent runs forever.
C.The discount factor is 1.
D.The task breaks down into independent sequences ending in a terminal state.
Correct Answer: The task breaks down into independent sequences ending in a terminal state.
Explanation:
Episodic tasks have a clear start and end point (e.g., a game of Chess), as opposed to continuous tasks.
Incorrect! Try again.
50What is the primary reason DQN was considered a breakthrough (published by DeepMind)?
A.It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
B.It solved the traveling salesman problem.
C.It used a new type of CPU.
D.It proved that gamma should always be 0.99.
Correct Answer: It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
Explanation:
DQN (2013/2015) was a breakthrough because it demonstrated human-level performance across many different games using the same architecture and hyperparameters, learning directly from pixels.