1What type of Reinforcement Learning algorithm is Q-Learning?
A.Model-based, On-policy
B.Model-free, On-policy
C.Model-based, Off-policy
D.Model-free, Off-policy
Correct Answer: Model-free, Off-policy
Explanation:Q-Learning is a model-free algorithm because it does not learn a model of the environment, and it is off-policy because it learns the value of the optimal policy independently of the agent's actions.
Incorrect! Try again.
2In Q-Learning, what does the 'Q' specifically represent?
A.Quantity
B.Quality
C.Query
D.Queue
Correct Answer: Quality
Explanation:The 'Q' stands for Quality, representing how useful a given action is in gaining some future reward.
Incorrect! Try again.
3What is the primary data structure used in basic tabular Q-Learning?
A.A Neural Network
B.A Q-Table
C.A Decision Tree
D.A Graph
Correct Answer: A Q-Table
Explanation:Tabular Q-Learning uses a Q-Table (lookup table) to store the Q-values for every state-action pair.
Incorrect! Try again.
4Which equation is used to update the Q-values in Q-Learning?
A.Maxwell's Equation
B.Bellman Equation
C.Schrodinger Equation
D.Euler's Equation
Correct Answer: Bellman Equation
Explanation:The Bellman Equation provides the recursive relationship used to update Q-values based on the reward and the value of the next state.
Incorrect! Try again.
5In the Q-learning update rule, what is the role of the learning rate (alpha)?
A.It determines the importance of future rewards.
B.It determines the probability of exploring a random action.
C.It controls how much the new information overrides the old information.
D.It calculates the total cumulative reward.
Correct Answer: It controls how much the new information overrides the old information.
Explanation:The learning rate (alpha) determines the extent to which newly acquired information overrides old information. Alpha=0 means nothing is learned; Alpha=1 means only the most recent information counts.
Incorrect! Try again.
6What is the purpose of the discount factor (gamma) in Q-Learning?
A.To balance immediate and future rewards
B.To set the learning speed
C.To determine the exploration rate
D.To initialize the Q-table
Correct Answer: To balance immediate and future rewards
Explanation:Gamma determines the present value of future rewards. A gamma of 0 makes the agent short-sighted (only cares about current reward), while a gamma close to 1 makes it far-sighted.
Incorrect! Try again.
7If the discount factor (gamma) is set to 0, what will the agent optimize for?
A.Long-term cumulative reward
B.Only the immediate reward
C.The average reward over time
D.Random rewards
Correct Answer: Only the immediate reward
Explanation:With gamma = 0, future rewards are multiplied by 0, so the agent only considers the immediate reward (r) received from the current action.
Incorrect! Try again.
8What is the 'Temporal Difference (TD) Error' in the context of Q-Learning?
A.The difference between the current Q-value and the previous Q-value
B.The difference between the target Q-value and the current predicted Q-value
C.The error in the reward function
D.The time it takes to converge
Correct Answer: The difference between the target Q-value and the current predicted Q-value
Explanation:TD Error is the difference between the estimated value of the current state (Target: Reward + discounted max future Q) and the current stored Q-value.
Incorrect! Try again.
9What is the Epsilon-Greedy strategy used for?
A.To calculate the loss function
B.To balance exploration and exploitation
C.To update the weights of the network
D.To store experiences in replay memory
Correct Answer: To balance exploration and exploitation
Explanation:Epsilon-greedy is a policy that balances exploration (choosing random actions) and exploitation (choosing the best known action) based on probability epsilon.
Incorrect! Try again.
10In an Epsilon-Greedy strategy, what happens if epsilon is 1?
A.The agent always chooses the action with the highest Q-value.
B.The agent always chooses a random action.
C.The agent stops learning.
D.The agent alternates between best and random actions.
Correct Answer: The agent always chooses a random action.
Explanation:If epsilon is 1, the probability of exploring (choosing a random action) is 100%.
Incorrect! Try again.
11What is the typical behavior of epsilon during the training process in Deep Q-Learning?
A.It starts low and increases over time.
B.It remains constant throughout training.
C.It starts high and decays over time.
D.It fluctuates randomly.
Correct Answer: It starts high and decays over time.
Explanation:Epsilon usually starts high to encourage exploration of the environment early on and decays (decreases) over time to exploit the learned policy as it improves.
Incorrect! Try again.
12Why does tabular Q-Learning fail in environments like Atari games or Robotics?
A.The math does not apply to games.
B.The rewards are not defined.
C.The state space is too large (Curse of Dimensionality).
D.Q-learning cannot handle discrete actions.
Correct Answer: The state space is too large (Curse of Dimensionality).
Explanation:Tabular Q-learning requires a row for every possible state. In environments with high-dimensional inputs (like image pixels), the number of states is astronomically large, making a table impossible to store or fill.
Incorrect! Try again.
13What replaces the Q-Table in a Deep Q-Network (DQN)?
A.A larger Q-Table
B.A Deep Neural Network
C.A Genetic Algorithm
D.A Linear Regression model
Correct Answer: A Deep Neural Network
Explanation:In DQN, a neural network is used as a function approximator to estimate Q-values for states, rather than storing them in a table.
Incorrect! Try again.
14What is the input to the neural network in a standard DQN for playing video games?
A.The current score
B.The raw pixels of the game screen (state)
C.The action to be taken
D.The Q-value
Correct Answer: The raw pixels of the game screen (state)
Explanation:The network takes the state (usually preprocessed screen pixels) as input and outputs Q-values for all possible actions.
Incorrect! Try again.
15What is the output layer size of a DQN used for an environment with 'N' discrete actions?
A.1 (The best action)
B.1 (The value of the state)
C.N (One Q-value for each action)
D.N x N
Correct Answer: N (One Q-value for each action)
Explanation:A DQN takes a state as input and outputs a vector of size N, representing the predicted Q-value for each of the N possible actions.
Incorrect! Try again.
16What is 'Experience Replay' in DQN?
A.Replaying the game after winning.
B.Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.
C.Running the same episode multiple times.
D.Using the target network to replay actions.
Correct Answer: Storing past transitions (s, a, r, s') in a buffer and sampling minibatches for training.
Explanation:Experience Replay involves storing agent experiences in a buffer and randomly sampling them during training to break temporal correlations between consecutive samples.
Incorrect! Try again.
17What is the primary benefit of using Experience Replay?
A.It increases the epsilon value.
B.It breaks the correlation between consecutive samples and stabilizes training.
C.It removes the need for a target network.
D.It guarantees finding the global minimum.
Correct Answer: It breaks the correlation between consecutive samples and stabilizes training.
Explanation:Sequential samples in RL are highly correlated. Random sampling from a replay buffer creates an i.i.d. (independent and identically distributed) data setting, which is better for neural network training.
Incorrect! Try again.
18In the context of DQN, what is the 'Target Network'?
A.The network that selects the action.
B.A copy of the main network with frozen weights used to calculate target Q-values.
C.The network that predicts the reward.
D.The network used during the testing phase only.
Correct Answer: A copy of the main network with frozen weights used to calculate target Q-values.
Explanation:The Target Network is a clone of the online (policy) network. Its weights are frozen and updated less frequently to provide a stable target for the loss function calculation.
Incorrect! Try again.
19Why is a Target Network necessary in DQN?
A.To prevent the 'chasing your own tail' instability where target values shift constantly.
B.To speed up the backpropagation process.
C.To increase the exploration rate.
D.To handle continuous action spaces.
Correct Answer: To prevent the 'chasing your own tail' instability where target values shift constantly.
Explanation:Without a target network, the update modifies the same network used to calculate the target value, causing the target to shift immediately, which leads to oscillation and divergence.
Incorrect! Try again.
20How are weights usually updated in the Target Network?
A.Continuous backpropagation along with the main network.
B.Copied from the main network every fixed number of steps.
C.Randomly initialized every step.
D.Updated using a different loss function.
Correct Answer: Copied from the main network every fixed number of steps.
Explanation:The weights of the main network are copied to the target network periodically (e.g., every 1000 steps) or via a soft update (Polyak averaging).
Incorrect! Try again.
21What is the loss function typically used in DQN?
A.Cross-Entropy Loss
B.Mean Squared Error (MSE) between predicted Q and target Q
C.Hinge Loss
D.Kullback-Leibler Divergence
Correct Answer: Mean Squared Error (MSE) between predicted Q and target Q
Explanation:DQN treats the Bellman update as a regression problem, minimizing the squared difference between the current Q-value and the target Q-value (Reward + gamma * max Q_next).
Incorrect! Try again.
22What main issue does 'Double DQN' address?
A.Slow convergence speed
B.Overestimation of Q-values
C.Underestimation of Q-values
D.High memory usage
Correct Answer: Overestimation of Q-values
Explanation:Standard DQN tends to overestimate Q-values because the max operator is used for both action selection and evaluation. Double DQN addresses this maximization bias.
Incorrect! Try again.
23In standard DQN, how is the target value calculated (ignoring reward and gamma)?
Explanation:Standard DQN uses the max Q-value of the next state directly from the target network, effectively selecting and evaluating the action using the same set of weights.
Incorrect! Try again.
24How does Double DQN calculate the target Q-value?
A.It uses two totally independent networks trained on different data.
B.It uses the Main network to select the best action and the Target network to evaluate its value.
C.It doubles the reward.
D.It uses the Target network to select the action and the Main network to evaluate it.
Correct Answer: It uses the Main network to select the best action and the Target network to evaluate its value.
Explanation:Double DQN decouples selection and evaluation: Action = argmax Q(Online Network), Value = Q(Target Network, Action).
Incorrect! Try again.
25What is the architectural change in Dueling DQN compared to standard DQN?
A.It uses two separate neural networks for two different agents.
B.It splits the network into two streams: Value stream and Advantage stream.
C.It removes the convolutional layers.
D.It uses Recurrent Neural Networks.
Correct Answer: It splits the network into two streams: Value stream and Advantage stream.
Explanation:Dueling DQN separates the estimator into two streams after the convolutional layers: one for the state Value function V(s) and one for the Advantage function A(s,a).
Incorrect! Try again.
26In Dueling DQN, what does the Value function V(s) estimate?
A.How good it is to be in a particular state, regardless of the action taken.
B.How good a specific action is compared to others.
C.The immediate reward.
D.The total error.
Correct Answer: How good it is to be in a particular state, regardless of the action taken.
Explanation:V(s) represents the scalar value of the state itself, independent of the specific action chosen.
Incorrect! Try again.
27In Dueling DQN, what does the Advantage function A(s, a) estimate?
A.The value of the state.
B.The importance of the state.
C.How much better taking action 'a' is compared to the average action in state 's'.
D.The probability of winning.
Correct Answer: How much better taking action 'a' is compared to the average action in state 's'.
Explanation:The Advantage function captures the relative importance of each action. Q(s,a) = V(s) + A(s,a).
Incorrect! Try again.
28How are the Value and Advantage streams combined in Dueling DQN to get Q-values?
A.Multiplication
B.Concatenation
C.Aggregation (Summation with normalization)
D.Convolution
Correct Answer: Aggregation (Summation with normalization)
Explanation:They are aggregated via addition, usually subtracting the mean (or max) of the advantage to ensure identifiability: Q = V + (A - mean(A)).
Incorrect! Try again.
29What is the main benefit of Dueling DQN?
A.It allows the agent to learn which states are valuable without having to learn the effect of every action.
B.It eliminates the need for Experience Replay.
C.It works without rewards.
D.It is computationally cheaper than standard DQN.
Correct Answer: It allows the agent to learn which states are valuable without having to learn the effect of every action.
Explanation:By decoupling V and A, the network can learn general state values more efficiently, which is useful in states where the choice of action doesn't affect the outcome much.
Incorrect! Try again.
30Which of the following is true about 'Off-policy' learning in Q-learning?
A.The agent learns the value of the policy it is currently executing.
B.The agent learns the value of the optimal policy regardless of the current actions taken.
C.It requires a model of the environment.
D.It cannot use Experience Replay.
Correct Answer: The agent learns the value of the optimal policy regardless of the current actions taken.
Explanation:Off-policy means the target policy (optimal, greedy) is different from the behavior policy (epsilon-greedy). Q-learning approximates optimal Q-values even while exploring.
Incorrect! Try again.
31What represents the 'State' in a Reinforcement Learning framework?
A.The feedback from the environment.
B.The decision maker.
C.The current situation or configuration of the environment.
D.The move made by the agent.
Correct Answer: The current situation or configuration of the environment.
Explanation:The State (S) is the observation or configuration of the environment at a specific time step.
Incorrect! Try again.
32In the Bellman optimality equation, what does 'max_a Q(s', a)' represent?
A.The worst possible future outcome.
B.The value of the best action available in the next state.
C.The average value of the next state.
D.The immediate reward.
Correct Answer: The value of the best action available in the next state.
Explanation:This term represents the estimated maximum cumulative future reward achievable from the next state s'.
Incorrect! Try again.
33If an agent reaches a terminal state, what is the Target Q-value?
A.The immediate reward (r) only.
B.Reward + gamma * max Q.
C.Zero.
D.Infinity.
Correct Answer: The immediate reward (r) only.
Explanation:In a terminal state, there is no next state, so the discounted future value is 0. The target is just the immediate reward.
Incorrect! Try again.
34Which of the following creates a 'Moving Target' problem in naive Deep Q-Learning?
A.Using a fixed target network.
B.Using the same network to calculate both predicted value and target value.
C.Using a replay buffer.
D.Using a small learning rate.
Correct Answer: Using the same network to calculate both predicted value and target value.
Explanation:If the same network weights are used for prediction and target calculation, every update shifts the target, making convergence difficult (like a dog chasing its tail).
Incorrect! Try again.
35What is 'Catastrophic Forgetting' in the context of RL?
A.The agent forgets the goal.
B.The agent forgets previously learned knowledge when training on new dissimilar experiences.
C.The replay buffer gets deleted.
D.The gradients vanish.
Correct Answer: The agent forgets previously learned knowledge when training on new dissimilar experiences.
Explanation:This happens when a neural network overwrites weights optimized for past experiences to fit new, correlated experiences. Experience replay helps mitigate this.
Incorrect! Try again.
36In Q-Learning, convergence to the optimal Q-values is guaranteed if:
A.The neural network is deep enough.
B.All state-action pairs are visited infinitely often and learning rate decays appropriately.
C.Epsilon is kept at 1.0.
D.The discount factor is 1.
Correct Answer: All state-action pairs are visited infinitely often and learning rate decays appropriately.
Explanation:For tabular Q-learning, convergence is proven if every state-action pair is visited infinitely often and the step sizes satisfy specific decay conditions (Robbins-Monro).
Incorrect! Try again.
37Which component of the tuple (S, A, R, S') is NOT known before the agent takes an action?
A.S (Current State)
B.A (Action chosen)
C.R and S' (Reward and Next State)
D.None of the above
Correct Answer: R and S' (Reward and Next State)
Explanation:The agent knows the current state and chooses an action. The Reward and Next State are returned by the environment after the action is executed.
Incorrect! Try again.
38In Double DQN, the update equation replaces the target 'Y' with:
A.R + gamma * Q_target(s', argmax Q_main(s', a))
B.R + gamma * max Q_target(s', a)
C.R + gamma * Q_main(s', argmax Q_target(s', a))
D.R + gamma * V(s')
Correct Answer: R + gamma * Q_target(s', argmax Q_main(s', a))
Explanation:This is the mathematical representation of Double DQN: select action using Main (argmax Q_main), evaluate using Target (Q_target).
Incorrect! Try again.
39What is the primary motivation for 'Prioritized Experience Replay'?
A.To replay recent experiences first.
B.To replay experiences where the agent had a high TD error (learned the most).
C.To save memory.
D.To ensure random sampling.
Correct Answer: To replay experiences where the agent had a high TD error (learned the most).
Explanation:Prioritized Experience Replay samples transitions with high TD error more frequently because these are the transitions where the agent's current predictions are most wrong (surprising).
Incorrect! Try again.
40When preprocessing images for DQN (e.g., Atari), what is a common technique?
A.Increasing resolution to 4K.
B.Converting to grayscale and resizing.
C.Adding noise.
D.Inverting colors.
Correct Answer: Converting to grayscale and resizing.
Explanation:To reduce computational complexity, images are typically converted to grayscale and downsampled (e.g., to 84x84) before being fed into the CNN.
Incorrect! Try again.
41In Dueling DQN, the aggregation layer usually subtracts the mean of the Advantage values. Why?
A.To reduce the size of the output.
B.For numerical stability and identifiability.
C.To make the values positive.
D.It is a requirement of the activation function.
Correct Answer: For numerical stability and identifiability.
Explanation:Subtracting the mean (or max) ensures the Q-value is unique (Identifiability). Without this, V(s) and A(s,a) could shift by arbitrary constants without changing Q.
Incorrect! Try again.
42What is an 'Episode' in Reinforcement Learning?
A.One step of training.
B.A sequence of states, actions, and rewards from start to a terminal state.
C.The entire training process.
D.A single update of the Q-table.
Correct Answer: A sequence of states, actions, and rewards from start to a terminal state.
Explanation:An episode is a complete run of the task, starting from an initial state and ending at a terminal state (e.g., game over or goal reached).
Incorrect! Try again.
43Which activation function is commonly used in the hidden layers of a Deep Q-Network?
A.Sigmoid
B.ReLU (Rectified Linear Unit)
C.Softmax
D.Step function
Correct Answer: ReLU (Rectified Linear Unit)
Explanation:ReLU is standard for hidden layers in CNNs and DQNs due to efficient computation and mitigation of the vanishing gradient problem.
Incorrect! Try again.
44Why is the Softmax function generally NOT used in the output layer of a DQN?
A.It is too slow.
B.DQN outputs Q-values (regression), not probabilities (classification).
C.It cannot handle negative numbers.
D.It is not differentiable.
Correct Answer: DQN outputs Q-values (regression), not probabilities (classification).
Explanation:DQN predicts expected future rewards (Q-values), which can be any real number. Softmax forces outputs to sum to 1, which applies to probabilities, not value estimation.
Incorrect! Try again.
45In the context of RL, what is 'Exploitation'?
A.Trying new actions to gather information.
B.Selecting the action currently believed to be optimal.
C.Stopping the training early.
D.Increasing the discount factor.
Correct Answer: Selecting the action currently believed to be optimal.
Explanation:Exploitation means using the current knowledge (Q-values) to maximize reward, typically by choosing the action with the highest Q-value.
Incorrect! Try again.
46What is 'Frame Stacking' in DQN for Atari games?
A.Stacking Q-tables on top of each other.
B.Stacking consecutive frames to capture motion/velocity.
C.Stacking multiple neural networks.
D.Stacking rewards.
Correct Answer: Stacking consecutive frames to capture motion/velocity.
Explanation:A single static frame doesn't show direction or speed. Stacking usually 4 consecutive frames creates a state representation that includes temporal information.
Incorrect! Try again.
47What optimization algorithm is typically used to train the DQN weights?
A.K-Means Clustering
B.Gradient Descent (e.g., RMSProp or Adam)
C.Genetic Algorithms
D.Principal Component Analysis
Correct Answer: Gradient Descent (e.g., RMSProp or Adam)
Explanation:Since DQN is a neural network minimizing a loss function (MSE), gradient descent optimizers like RMSProp or Adam are used.
Incorrect! Try again.
48If the Q-values for all actions in a state are equal, what will an epsilon-greedy policy (with epsilon=0) do?
A.Choose an action randomly among them (or the first one).
B.Stop the episode.
C.Choose no action.
D.Increase epsilon.
Correct Answer: Choose an action randomly among them (or the first one).
Explanation:If values are tied, the argmax function typically breaks ties arbitrarily (e.g., first index) or randomly among the tied best actions.
Incorrect! Try again.
49Which of the following implies that an RL problem is 'episodic'?
A.The agent runs forever.
B.The task breaks down into independent sequences ending in a terminal state.
C.The discount factor is 1.
D.The environment is deterministic.
Correct Answer: The task breaks down into independent sequences ending in a terminal state.
Explanation:Episodic tasks have a clear start and end point (e.g., a game of Chess), as opposed to continuous tasks.
Incorrect! Try again.
50What is the primary reason DQN was considered a breakthrough (published by DeepMind)?
A.It solved the traveling salesman problem.
B.It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
C.It proved that gamma should always be 0.99.
D.It used a new type of CPU.
Correct Answer: It was the first algorithm to master a wide range of Atari 2600 games using only raw pixels and scores.
Explanation:DQN (2013/2015) was a breakthrough because it demonstrated human-level performance across many different games using the same architecture and hyperparameters, learning directly from pixels.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.