1What is the primary goal of an optimization algorithm in the context of training a machine learning model?
Optimization problem formulation in ML
Easy
A.To make the model as complex as possible
B.To minimize the loss function
C.To maximize the number of features
D.To increase the size of the dataset
Correct Answer: To minimize the loss function
Explanation:
In machine learning, optimization is the process of finding the model parameters that minimize the loss function, which measures the error between the model's predictions and the actual true values.
Incorrect! Try again.
2In machine learning, a 'loss function' or 'cost function' quantifies:
Optimization problem formulation in ML
Easy
A.The model's overall accuracy on the test set
B.The computational resources used for training
C.The number of training iterations (epochs)
D.The error or 'cost' of the model's predictions
Correct Answer: The error or 'cost' of the model's predictions
Explanation:
A loss function is a mathematical function that measures how far a model's prediction is from the target value. The goal of training is to adjust the model's parameters to make this value as small as possible.
Incorrect! Try again.
3What is a key property of a convex function that makes it desirable for optimization?
Convex sets and convex functions
Easy
A.It cannot be minimized
B.Any local minimum is also a global minimum
C.It has multiple local minima
D.It is always a linear function
Correct Answer: Any local minimum is also a global minimum
Explanation:
For a convex function, any point that is a local minimum is guaranteed to be the single global minimum. This simplifies optimization, as an algorithm won't get stuck in a suboptimal local minimum.
Incorrect! Try again.
4A set of points is defined as a 'convex set' if:
Convex sets and convex functions
Easy
A.It has a finite number of points
B.It is shaped like a perfect circle or sphere
C.It must contain the origin (0,0)
D.A line segment connecting any two points within the set lies entirely within the set
Correct Answer: A line segment connecting any two points within the set lies entirely within the set
Explanation:
This is the fundamental definition of a convex set. If you can pick any two points in the set and the straight line between them stays inside the set, the set is convex.
Incorrect! Try again.
5The gradient of a function at a specific point, denoted as , points in the direction of:
Gradients and directional derivatives
Easy
A.The steepest descent
B.The function's origin
C.Zero change (a flat area)
D.The steepest ascent
Correct Answer: The steepest ascent
Explanation:
The gradient is a vector that points in the direction where the function's value increases most rapidly. Its magnitude indicates the rate of this increase.
Incorrect! Try again.
6In gradient-based optimization, why do we move in the direction of the negative gradient?
Gradients and directional derivatives
Easy
A.To move towards a maximum of the function
B.Because the gradient is always a negative value
C.To move towards a minimum of the function
D.To increase the learning rate
Correct Answer: To move towards a minimum of the function
Explanation:
Since the gradient points in the direction of steepest increase, the negative gradient () points in the direction of steepest decrease. Following this direction helps us find a minimum of the loss function.
Incorrect! Try again.
7If the gradient of a loss function at a certain point is zero, what does this indicate?
Gradients and directional derivatives
Easy
A.The model has perfectly fit the data
B.The algorithm has encountered an error
C.The point is a critical point (minimum, maximum, or saddle point)
D.The learning rate is too high
Correct Answer: The point is a critical point (minimum, maximum, or saddle point)
Explanation:
A zero gradient signifies a 'flat' spot in the loss landscape. This occurs at local minima, maxima, and saddle points, where a small step in any direction does not change the function's value.
Incorrect! Try again.
8What is the role of the 'learning rate' in the Gradient Descent algorithm?
Gradient descent and variants (batch, stochastic, mini-batch)
Easy
A.It determines the step size taken during each iteration
B.It measures the accuracy of the model
C.It defines the size of the mini-batch
D.It specifies the total number of iterations
Correct Answer: It determines the step size taken during each iteration
Explanation:
The learning rate is a hyperparameter that scales the gradient. It controls how large of a step the algorithm takes in the direction of the negative gradient to update the model's weights.
Incorrect! Try again.
9Which variant of Gradient Descent calculates the gradient using the entire training dataset for a single parameter update?
Gradient descent and variants (batch, stochastic, mini-batch)
Easy
A.Stochastic Gradient Descent (SGD)
B.Adam
C.Batch Gradient Descent
D.Mini-batch Gradient Descent
Correct Answer: Batch Gradient Descent
Explanation:
Batch Gradient Descent processes all training examples for each single update. While it provides a very accurate gradient, it can be extremely slow and computationally expensive for large datasets.
Incorrect! Try again.
10Stochastic Gradient Descent (SGD) updates the model's parameters using:
Gradient descent and variants (batch, stochastic, mini-batch)
Easy
A.Only the validation dataset
B.The entire training dataset
C.A small batch of training examples
D.A single training example at a time
Correct Answer: A single training example at a time
Explanation:
Stochastic Gradient Descent performs one update for each training example. This makes each update much faster but also noisier, leading to a less stable convergence path.
Incorrect! Try again.
11What is the primary advantage of Mini-batch Gradient Descent over Batch Gradient Descent?
Gradient descent and variants (batch, stochastic, mini-batch)
Easy
A.It does not require setting a learning rate
B.It is computationally more efficient and faster for large datasets
C.It always finds a better minimum
D.It guarantees convergence in fewer iterations
Correct Answer: It is computationally more efficient and faster for large datasets
Explanation:
By using a small batch of data instead of the entire dataset, mini-batch gradient descent provides a good balance between the accuracy of the batch gradient and the speed of the stochastic gradient.
Incorrect! Try again.
12What is the core idea behind the Momentum optimization algorithm?
Momentum-based optimization
Easy
A.To accelerate movement in the relevant direction and dampen oscillations
B.To use a different learning rate for every parameter
C.To only use data points that have high error
D.To randomly change direction to escape local minima
Correct Answer: To accelerate movement in the relevant direction and dampen oscillations
Explanation:
Momentum adds a fraction of the previous update vector to the current one. This helps the optimizer build up speed in directions with a consistent gradient and reduces oscillations in other directions, much like a ball rolling down a hill.
Incorrect! Try again.
13How does momentum help when the loss surface has long, narrow ravines (valleys)?
Momentum-based optimization
Easy
A.It increases the learning rate to flatten the ravine
B.It forces the updates to jump out of the ravine
C.It stops the optimization process
D.It helps accelerate progress along the bottom of the ravine
Correct Answer: It helps accelerate progress along the bottom of the ravine
Explanation:
In narrow ravines, the gradient often oscillates across the steep sides. Momentum dampens these side-to-side oscillations while accumulating the gradient along the length of the ravine, leading to faster convergence.
Incorrect! Try again.
14The Adam optimizer is an adaptive learning rate method that combines the key ideas of which two other optimizers?
RMSProp and Adam
Easy
A.Stochastic Gradient Descent and Batch Gradient Descent
B.L-BFGS and Momentum
C.Momentum and RMSProp
D.Adagrad and Newton's Method
Correct Answer: Momentum and RMSProp
Explanation:
Adam (Adaptive Moment Estimation) combines the 'momentum' concept of using a moving average of the gradient with the adaptive learning rate feature of RMSProp, which uses a moving average of the squared gradients.
Incorrect! Try again.
15What does an 'adaptive learning rate' mean in the context of optimizers like RMSProp and Adam?
RMSProp and Adam
Easy
A.The optimizer maintains a separate learning rate for each model parameter
B.The learning rate is randomly chosen at each step
C.The user must manually adapt the learning rate during training
D.The learning rate steadily increases over time
Correct Answer: The optimizer maintains a separate learning rate for each model parameter
Explanation:
Adaptive learning rate methods adjust the learning rate for each parameter individually. Parameters that receive large gradients will have their effective learning rate reduced, while parameters with small gradients will have theirs increased.
Incorrect! Try again.
16What problem seen in other algorithms like Adagrad does RMSProp help to solve?
RMSProp and Adam
Easy
A.The updates being too noisy
B.The use of too much memory
C.The learning rate becoming aggressively small and nearly stopping learning
D.The learning rate becoming too large and causing divergence
Correct Answer: The learning rate becoming aggressively small and nearly stopping learning
Explanation:
In Adagrad, the learning rate can become infinitely small over time because it accumulates all past squared gradients. RMSProp fixes this by using an exponentially decaying average, preventing the learning rate from vanishing too quickly.
Incorrect! Try again.
17When training a deep learning model on a massive dataset, which gradient descent variant is most often preferred for practical reasons?
Optimization considerations in large-scale ML systems
Easy
A.Mini-batch Gradient Descent
B.Grid Search
C.Batch Gradient Descent
D.Newton's Method
Correct Answer: Mini-batch Gradient Descent
Explanation:
Batch Gradient Descent is computationally infeasible for very large datasets as it requires loading all data for one update. Mini-batch GD provides a compromise, offering efficient computation and more stable convergence than pure SGD.
Incorrect! Try again.
18A common hardware-related challenge in large-scale machine learning is:
Optimization considerations in large-scale ML systems
Easy
A.The computer case not having enough fans
B.Running out of hard drive space to store Python scripts
C.The keyboard wearing out from too much coding
D.Fitting the model and a batch of data into GPU memory (VRAM)
Correct Answer: Fitting the model and a batch of data into GPU memory (VRAM)
Explanation:
Modern deep learning models can have billions of parameters, and training data can be large. A major constraint is the limited amount of high-speed memory (VRAM) on GPUs, which must hold the model, the data batch, and the gradients.
Incorrect! Try again.
19Why are convex optimization problems generally easier to solve than non-convex ones?
Convex sets and convex functions
Easy
A.They require less data to solve
B.They can only be used for linear models
C.They always converge in a single step
D.They do not have local minima that could trap the optimization algorithm
Correct Answer: They do not have local minima that could trap the optimization algorithm
Explanation:
Non-convex problems can have many local minima, and an algorithm like gradient descent might get stuck in one that is not the best overall solution (the global minimum). Convex problems have only one minimum, which is the global minimum.
Incorrect! Try again.
20The path of Stochastic Gradient Descent (SGD) towards the minimum is often described as 'noisy' or 'zig-zagging'. Why is this?
Gradient descent and variants (batch, stochastic, mini-batch)
Easy
A.Because the gradient is estimated based on only one training sample at a time
B.Because the algorithm adds random noise on purpose
C.Because the learning rate is constantly increasing
D.Because it uses the entire dataset for each step
Correct Answer: Because the gradient is estimated based on only one training sample at a time
Explanation:
The gradient from a single sample can be a poor estimate of the true gradient over the whole dataset. This causes the updates to fluctuate and follow a noisy path, though the general direction is still towards the minimum.
Incorrect! Try again.
21In a logistic regression model, the goal is to find parameters that maximize the likelihood of the training data. How is this typically formulated as a minimization problem for optimization algorithms like gradient descent?
Optimization problem formulation in ML
Medium
A.By minimizing the likelihood directly
B.By maximizing the L2 norm of the parameters
C.By minimizing the negative log-likelihood
D.By minimizing the number of misclassified points
Correct Answer: By minimizing the negative log-likelihood
Explanation:
Optimization algorithms are usually designed to minimize a function. Maximizing the likelihood is equivalent to maximizing its logarithm (since log is a monotonic function), which in turn is equivalent to minimizing the negative log-likelihood, . This is the standard loss function for logistic regression.
Incorrect! Try again.
22If a function is convex and differentiable, and we find a point where the gradient , what can we conclude about ?
Convex sets and convex functions
Medium
A. is a global minimum.
B. could be a local maximum.
C. is a saddle point.
D. is a local minimum, but not necessarily global.
Correct Answer: is a global minimum.
Explanation:
For a convex function, any point where the gradient is zero is a global minimum. This is a key property that makes optimizing convex functions much more reliable than non-convex ones, as there are no local minima that aren't also global.
Incorrect! Try again.
23What is the primary trade-off when switching from Batch Gradient Descent (BGD) to Stochastic Gradient Descent (SGD) for a large dataset?
Gradient descent and variants (batch, stochastic, mini-batch)
Medium
A.Trading faster, more frequent updates for a noisier, less direct convergence path.
B.Trading the need for a learning rate for automatic step size adjustment.
C.Trading a smooth convergence path for a higher computational cost per epoch.
D.Trading lower memory usage for a guaranteed convergence to the global minimum.
Correct Answer: Trading faster, more frequent updates for a noisier, less direct convergence path.
Explanation:
SGD updates the model parameters using only one data point at a time, making each update much faster but also much noisier than BGD, which uses the entire dataset. This noise can help escape shallow local minima but results in a more erratic convergence path.
Incorrect! Try again.
24In the context of Gradient Descent, what is the main role of the momentum term?
Momentum-based optimization
Medium
A.It guarantees that the algorithm will find the global minimum in non-convex functions.
B.It helps accelerate convergence in relevant directions and dampens oscillations in others.
C.It adapts the learning rate for each parameter individually based on past gradients.
D.It normalizes the gradient vector to have a unit length.
Correct Answer: It helps accelerate convergence in relevant directions and dampens oscillations in others.
Explanation:
The momentum term accumulates a velocity vector in directions of persistent gradient. This helps the optimizer move faster along shallow ravines (consistent gradient direction) and dampens oscillations across steep directions (where the gradient sign flips), leading to faster and more stable convergence.
Incorrect! Try again.
25The RMSProp optimization algorithm was primarily designed to solve which specific problem encountered in the Adagrad algorithm?
RMSProp and Adam
Medium
A.The tendency to get stuck in saddle points.
B.The aggressive and monotonically decreasing learning rate.
C.The inability to handle sparse data effectively.
D.The high computational cost of calculating second-order derivatives.
Correct Answer: The aggressive and monotonically decreasing learning rate.
Explanation:
Adagrad adapts the learning rate for each parameter but accumulates all past squared gradients in the denominator. This causes the learning rate to shrink monotonically and eventually become too small, effectively stopping learning. RMSProp fixes this by using an exponentially decaying average of squared gradients, preventing the learning rate from vanishing.
Incorrect! Try again.
26Which of the following functions is non-convex?
Convex sets and convex functions
Medium
A. on the interval
B. (L2 norm)
C.
D.
Correct Answer: on the interval
Explanation:
A function is convex if the line segment between any two points on its graph lies on or above the graph. The sine function oscillates, and a line segment connecting two points (e.g., at and ) will pass below the graph, violating the definition of convexity. The other functions are classic examples of convex functions.
Incorrect! Try again.
27For a function , what is the directional derivative at the point in the direction of the vector ?
Gradients and directional derivatives
Medium
A.18/5
B.14/5
C.18
D.14
Correct Answer: 18/5
Explanation:
The correct option follows directly from the given concept and definitions.
Incorrect! Try again.
28Which statement best describes the advantage of Mini-batch Gradient Descent over both Stochastic Gradient Descent (SGD) and Batch Gradient Descent (BGD)?
Gradient descent and variants (batch, stochastic, mini-batch)
Medium
A.It balances the stability of BGD with the efficiency of SGD by using a small subset of data for each update.
B.It uses the full dataset for each update, but in a vectorized and more efficient manner than BGD.
C.It converges to the global minimum faster than both SGD and BGD on all types of problems.
D.It does not require tuning a learning rate, unlike SGD and BGD.
Correct Answer: It balances the stability of BGD with the efficiency of SGD by using a small subset of data for each update.
Explanation:
Mini-batch GD offers a compromise. By using a small batch of data (e.g., 32-512 samples), it reduces the variance of the updates compared to SGD (making convergence more stable) and is computationally more efficient than BGD on large datasets. It also allows for hardware-level vectorization benefits.
Incorrect! Try again.
29The Adam optimizer can be viewed as a combination of which two other optimization algorithms?
RMSProp and Adam
Medium
A.Momentum and RMSProp
B.Newton's method and Batch Gradient Descent
C.L-BFGS and RMSProp
D.Adagrad and Nesterov Momentum
Correct Answer: Momentum and RMSProp
Explanation:
Adam (Adaptive Moment Estimation) combines the ideas of Momentum and RMSProp. It uses an exponentially decaying average of past gradients (like Momentum) to estimate the first moment (velocity), and an exponentially decaying average of past squared gradients (like RMSProp) to estimate the second moment (adaptive learning rate).
Incorrect! Try again.
30In a large-scale distributed training system using data parallelism, what is the typical procedure for a single training step?
Optimization considerations in large-scale ML systems
Medium
A.The dataset is split, and each worker trains a completely independent model on its subset, which are later averaged.
B.The model itself is split across multiple workers, and each worker is responsible for computing a part of the forward and backward pass.
C.The model is replicated on multiple workers, each processes a different batch of data, and their resulting gradients are aggregated to update a central model.
D.A central server sends one data point at a time to each worker, waits for the gradient, and updates the model before sending the next point.
Correct Answer: The model is replicated on multiple workers, each processes a different batch of data, and their resulting gradients are aggregated to update a central model.
Explanation:
This describes synchronous data parallelism. The model is copied to each worker. Each worker computes gradients on its own mini-batch of data. These gradients are then aggregated (e.g., averaged) to compute a single update, which is applied to the central model. The updated model is then re-distributed to the workers.
Incorrect! Try again.
31At a given point on the surface of a loss function , the negative gradient vector, , points in which direction?
Gradients and directional derivatives
Medium
A.The direction of steepest descent.
B.A direction tangent to the contour line.
C.The direction of steepest ascent.
D.The direction directly towards the global minimum.
Correct Answer: The direction of steepest descent.
Explanation:
By definition, the gradient vector points in the direction of the steepest ascent of the function . Therefore, the negative gradient, , points in the direction of the steepest descent, which is precisely the direction gradient descent algorithms follow to minimize the function.
Incorrect! Try again.
32In the context of a machine learning optimization problem, what is the primary purpose of adding a regularization term (e.g., L1 or L2) to the loss function?
Optimization problem formulation in ML
Medium
A.To speed up the convergence of gradient descent.
B.To ensure the loss function is always convex.
C.To penalize model complexity and prevent overfitting.
D.To increase the model's accuracy on the training data.
Correct Answer: To penalize model complexity and prevent overfitting.
Explanation:
Regularization terms add a penalty based on the magnitude of the model's parameters to the loss function. This discourages the model from learning overly complex patterns that fit the training data's noise, thus improving its ability to generalize to new, unseen data (i.e., preventing overfitting).
Incorrect! Try again.
33How does Nesterov Accelerated Gradient (NAG) differ fundamentally from standard Momentum?
Momentum-based optimization
Medium
A.NAG uses a fixed momentum coefficient of 0.9, while standard momentum requires tuning.
B.NAG is a second-order optimization method, whereas standard momentum is a first-order method.
C.NAG calculates the gradient at a 'lookahead' position, after applying the current velocity, rather than at the current position.
D.NAG uses a decaying average of squared gradients instead of a simple velocity term.
Correct Answer: NAG calculates the gradient at a 'lookahead' position, after applying the current velocity, rather than at the current position.
Explanation:
Standard momentum first calculates the gradient at the current position and then updates the velocity. NAG is smarter; it first makes a temporary jump in the direction of the current velocity (a 'lookahead' point) and then calculates the gradient at that new point to make a more informed correction. This helps it slow down more effectively when approaching a minimum.
Incorrect! Try again.
34You are observing the training loss curve for a model. The loss decreases overall, but it is highly erratic and noisy from one iteration to the next. Which optimization variant is most likely being used?
Gradient descent and variants (batch, stochastic, mini-batch)
Medium
A.L-BFGS
B.Newton's Method
C.Batch Gradient Descent (BGD)
D.Stochastic Gradient Descent (SGD)
Correct Answer: Stochastic Gradient Descent (SGD)
Explanation:
The high variance in the loss curve is a classic characteristic of SGD. Because each update is based on a single, randomly chosen data point, the gradient estimate is noisy, causing the loss to fluctuate significantly between iterations even as the overall trend is downward. BGD would produce a much smoother curve.
Incorrect! Try again.
35For a convex function , Jensen's inequality states that for any set of points in its domain and any non-negative weights that sum to 1, which of the following is true?
Convex sets and convex functions
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
Jensen's inequality is a fundamental property of convex functions. It states that the function evaluated at a weighted average of points is less than or equal to the weighted average of the function's values at those points. Geometrically, this means the chord connecting two points on the graph of a convex function lies on or above the graph.
Incorrect! Try again.
36In the Adam optimizer, what is the purpose of the bias correction step for the first and second moment estimates ( and )?
RMSProp and Adam
Medium
A.To counteract the fact that the moment estimates are initialized at zero and are therefore biased towards zero, especially during initial steps.
B.To ensure the learning rate remains positive throughout training.
C.To normalize the gradients to prevent them from becoming too large (exploding gradients).
D.To add a momentum-like term to the update rule.
Correct Answer: To counteract the fact that the moment estimates are initialized at zero and are therefore biased towards zero, especially during initial steps.
Explanation:
The moment estimates ( and ) are exponentially moving averages initialized to zero. In the early stages of training, they are biased towards zero. The bias correction step divides them by and respectively, which scales them up to counteract this initial bias, leading to larger and more accurate updates early in training.
Incorrect! Try again.
37If the gradient of a non-convex, differentiable function is zero at a point , i.e., , what can be said about ?
Gradients and directional derivatives
Medium
A. must be a saddle point.
B. must be a local maximum.
C. must be a global minimum.
D. is a stationary point, which could be a local minimum, local maximum, or a saddle point.
Correct Answer: is a stationary point, which could be a local minimum, local maximum, or a saddle point.
Explanation:
A point where the gradient is zero is called a stationary point. For a general non-convex function, a stationary point can be a local minimum (a valley), a local maximum (a hill), or a saddle point. Without second-derivative information (the Hessian), we cannot distinguish between these possibilities.
Incorrect! Try again.
38When would a machine learning engineer most likely choose model parallelism over data parallelism for training a large neural network?
Optimization considerations in large-scale ML systems
Medium
A.When the dataset is extremely large but the model is relatively small.
B.When the model is too large to fit into the memory of a single GPU/accelerator.
C.When training on a single machine with multiple cores.
D.When they want to reduce the noise in gradient updates.
Correct Answer: When the model is too large to fit into the memory of a single GPU/accelerator.
Explanation:
Model parallelism involves splitting the model itself across multiple devices. This approach is necessary when a single model's parameters, activations, and gradients are so large that they exceed the available memory (e.g., GPU VRAM) of a single device. Data parallelism is used when the model fits on one device but the dataset is very large.
Incorrect! Try again.
39In momentum-based gradient descent, what is a potential downside of setting the momentum coefficient (e.g., ) very close to 1 (e.g., 0.999)?
Momentum-based optimization
Medium
A.The update rule will effectively become equivalent to standard SGD.
B.The optimizer may overshoot the minimum and struggle to stop, especially if the learning rate is not decreased.
C.The optimizer will converge much more slowly as it will heavily dampen all updates.
D.The memory required for training will increase quadratically.
Correct Answer: The optimizer may overshoot the minimum and struggle to stop, especially if the learning rate is not decreased.
Explanation:
A high momentum coefficient means the velocity vector places a very heavy emphasis on the previous direction. This can cause the optimizer to build up too much speed and 'overshoot' a minimum, oscillating back and forth across a valley instead of settling into it. It makes the optimizer less responsive to sudden changes in the loss landscape.
Incorrect! Try again.
40You are training a model on a dataset with highly redundant data (e.g., many nearly identical images). Which optimization variant would likely offer the most significant computational speedup over Batch Gradient Descent without a major loss in convergence quality?
Gradient descent and variants (batch, stochastic, mini-batch)
Medium
A.Conjugate Gradient.
B.L-BFGS.
C.Newton's Method.
D.Mini-batch or Stochastic Gradient Descent.
Correct Answer: Mini-batch or Stochastic Gradient Descent.
Explanation:
When data is highly redundant, the gradient computed from a small batch or even a single sample is a good approximation of the gradient of the entire dataset. Therefore, performing many cheap updates using Mini-batch GD or SGD will lead to much faster convergence in terms of wall-clock time compared to one expensive BGD update, which needlessly re-processes similar information.
Incorrect! Try again.
41Consider the Support Vector Machine (SVM) optimization problem with a soft margin. Which of the following statements about its dual formulation is most accurate?
Optimization problem formulation in ML
Hard
A.The dual problem involves Lagrange multipliers that are constrained to be negative and sum to one.
B.The dual problem is an unconstrained quadratic programming problem whose dimensionality depends on the number of features.
C.The dual problem's objective function is maximized, and its dimensionality is determined by the number of training samples, making it suitable for high-dimensional feature spaces.
D.The dual problem is computationally equivalent to the primal for all kernel types, offering no specific advantage.
Correct Answer: The dual problem's objective function is maximized, and its dimensionality is determined by the number of training samples, making it suitable for high-dimensional feature spaces.
Explanation:
The dual SVM problem maximizes an objective function with respect to the Lagrange multipliers (one for each data point). Its dimensionality is (number of samples), not (number of features). This is advantageous when , especially with the kernel trick, as the computation depends on the dot products of feature vectors, not the vectors themselves. The Lagrange multipliers are constrained by and .
Incorrect! Try again.
42Let be a function. The epigraph of , denoted epi(f), is a subset of . What is the most precise relationship between the convexity of the function and the convexity of its epigraph epi(f)?
Convex sets and convex functions
Hard
A.If epi(f) is a convex set, then must be a convex function, but the converse is not always true.
B.The convexity of is completely independent of the convexity of epi(f).
C. is a convex function if and only if epi(f) is a convex set.
D.If is a convex function, then epi(f) must be a convex set, but the converse is not always true.
Correct Answer: is a convex function if and only if epi(f) is a convex set.
Explanation:
A function is defined as convex if its epigraph, which is the set of points lying on or above its graph, i.e., epi(f) = , is a convex set. This is a fundamental definition and the relationship is 'if and only if'.
Incorrect! Try again.
43Consider the function . What is the subgradient at the point ?
Gradients and directional derivatives
Hard
A.The set of all vectors such that where .
B.A unique vector .
C.The empty set, as the gradient does not exist.
D.The set of all vectors such that where .
Correct Answer: The set of all vectors such that where .
Explanation:
For a differentiable component , the partial derivative is . For a non-differentiable component , the subdifferential is the interval . Therefore, the subgradient at is the set of vectors , where can be any value in .
Incorrect! Try again.
44For a function that is -smooth but not necessarily convex, which statement about the convergence of Batch Gradient Descent (BGD) with a fixed learning rate is most accurate?
Gradient descent and variants (batch, stochastic, mini-batch)
Hard
A.BGD is guaranteed to converge to a global minimum if .
B.BGD is guaranteed to converge to a local minimum for any .
C.BGD may diverge, oscillate, or converge to a stationary point, but it is not guaranteed to converge to even a local minimum.
D.BGD is guaranteed to converge to a stationary point (where ) if .
Correct Answer: BGD is guaranteed to converge to a stationary point (where ) if .
Explanation:
For an -smooth function (meaning its gradient is Lipschitz continuous with constant ), Batch Gradient Descent with a fixed learning rate is guaranteed to converge to a stationary point (a point where the gradient is zero). This point could be a local minimum, a local maximum, or a saddle point. Convergence to a global minimum is only guaranteed if the function is also convex.
Incorrect! Try again.
45How does Nesterov Accelerated Gradient (NAG) differ fundamentally from classical momentum in its update rule?
Momentum-based optimization
Hard
A.NAG calculates the gradient and the momentum update completely independently and combines them with a weighted average.
B.NAG first takes a step in the direction of the accumulated momentum, computes the gradient at this "lookahead" position, and then makes a correction.
C.NAG uses a larger momentum term () than classical momentum, leading to faster acceleration.
D.NAG computes the gradient at the current position, then takes a large step in the direction of the accumulated momentum.
Correct Answer: NAG first takes a step in the direction of the accumulated momentum, computes the gradient at this "lookahead" position, and then makes a correction.
Explanation:
Classical momentum calculates the gradient at the current position and updates the velocity vector. NAG introduces a 'lookahead' step. It first moves temporarily to , calculates the gradient at this lookahead position, and then uses this gradient to compute the final update. This 'lookahead and correct' mechanism allows it to slow down more effectively when approaching a minimum.
Incorrect! Try again.
46In the Adam optimization algorithm, what is the primary purpose of the bias correction step for the first and second moment estimates ( and )?
RMSProp and Adam
Hard
A.To initialize the moment estimates to non-zero values, avoiding division by zero.
B.To prevent the learning rate from becoming too large during the initial stages of training.
C.To counteract the tendency of the moment estimates to be biased towards zero, especially during the initial timesteps.
D.To normalize the gradients, ensuring they have a unit norm before being used in the update.
Correct Answer: To counteract the tendency of the moment estimates to be biased towards zero, especially during the initial timesteps.
Explanation:
The first and second moment estimates, and , are calculated as exponential moving averages. Since they are initialized to zero, they are biased towards zero, particularly during early training. The bias correction step, and , scales them up to counteract this initialization bias, making the initial updates more accurate.
Incorrect! Try again.
47In a large-scale distributed training system using asynchronous stochastic gradient descent (Async-SGD), what is the primary challenge that can impede convergence compared to its synchronous counterpart?
Optimization considerations in large-scale ML systems
Hard
A."Stale gradients," where a worker updates the central model using a gradient computed based on parameters that are already outdated.
B.Increased memory overhead on the parameter server to store multiple versions of the model.
C.Network latency causing some workers to become completely idle.
D.The requirement for a high-bandwidth connection between the parameter server and workers.
Correct Answer: "Stale gradients," where a worker updates the central model using a gradient computed based on parameters that are already outdated.
Explanation:
In Async-SGD, workers compute gradients and update a central parameter server without waiting for other workers. This means an update might be based on a 'stale' version of the model parameters, as other workers may have updated them in the meantime. This staleness introduces noise and variance into the optimization process, which can slow down or destabilize convergence.
Incorrect! Try again.
48Let and be two convex functions defined on . Which of the following functions is NOT guaranteed to be convex?
Convex sets and convex functions
Hard
A. for non-negative scalars .
B.
C.
D., assuming is also non-decreasing.
Correct Answer:
Explanation:
The product of two convex functions is not generally convex. For example, let and . Both are convex, but their product is a non-convex function with multiple local minima. Pointwise maximum (A), non-negative weighted sums (B), and composition with a non-decreasing convex outer function (C) are all operations that preserve convexity.
Incorrect! Try again.
49In Stochastic Gradient Descent (SGD), how does the variance of the gradient estimate typically behave as the optimizer approaches a local minimum?
Gradient descent and variants (batch, stochastic, mini-batch)
Hard
A.The variance remains constant regardless of the proximity to the minimum.
B.The variance increases exponentially as the optimizer gets closer to the minimum.
C.The variance decreases to zero as the true gradient approaches zero.
D.The variance does not approach zero, which prevents SGD with a fixed learning rate from converging to the exact minimum.
Correct Answer: The variance does not approach zero, which prevents SGD with a fixed learning rate from converging to the exact minimum.
Explanation:
The variance of the stochastic gradient does not necessarily go to zero even when the full batch gradient approaches zero at a minimum. This persistent variance causes SGD with a fixed learning rate to 'bounce around' the minimum rather than converging to the exact point. To achieve convergence to the minimum, the learning rate must be annealed (decreased) over time.
Incorrect! Try again.
50Despite its general effectiveness, Adam has been observed to fail to converge on certain simple, convex optimization problems where standard SGD with momentum converges. What is a key theoretical reason for this phenomenon?
RMSProp and Adam
Hard
A.The algorithm can be overly sensitive to the choice of the epsilon () parameter.
B.The adaptive learning rate can become excessively small in directions with historically large but currently informative gradients.
C.The second moment estimate can cause the effective learning rate to be dominated by past, irrelevant gradient information, leading to convergence to a suboptimal point.
D.The use of bias correction causes instability in later stages of training.
Correct Answer: The second moment estimate can cause the effective learning rate to be dominated by past, irrelevant gradient information, leading to convergence to a suboptimal point.
Explanation:
Research has shown that Adam's long-term memory of squared gradients () can be problematic. If large gradients are seen early in training, becomes large, shrinking the effective learning rate. Even if later gradients are small but consistently point to the optimum, the optimizer may fail to make progress because the historical information in keeps the learning rate too small.
Incorrect! Try again.
51When formulating a linear regression problem, we can add an L1 (Lasso) or L2 (Ridge) regularization term. From an optimization landscape perspective, what is the most significant difference between the objective functions created by these two regularizers?
Optimization problem formulation in ML
Hard
A.The L2-regularized objective is smooth and strictly convex, while the L1-regularized objective is non-smooth but still strictly convex.
B.The L1-regularized objective function is non-differentiable at points where any coefficient is zero, leading to sparse solutions, while the L2 objective is smooth everywhere.
C.The L2-regularized objective has a unique global minimum, whereas the L1-regularized objective can have multiple global minima.
D.The L1 regularizer penalizes large coefficients more heavily than the L2 regularizer, preventing any single weight from dominating.
Correct Answer: The L1-regularized objective function is non-differentiable at points where any coefficient is zero, leading to sparse solutions, while the L2 objective is smooth everywhere.
Explanation:
The L1 norm, , creates 'kinks' in the loss surface where any weight , making it non-differentiable. Optimizers are often driven into these kinks, which results in sparse solutions (exactly zero weights). The L2 norm, , is a smooth quadratic function, resulting in a smooth loss surface that encourages small, but typically non-zero, weights. This difference in differentiability is the key.
Incorrect! Try again.
52Consider the function . What is the directional derivative of at the point in the direction ?
Gradients and directional derivatives
Hard
A.0
B.The directional derivative does not exist.
C.-4
D.-1
Correct Answer: -1
Explanation:
The point lies on the non-differentiable seam where . We use the definition: . Here, and . The function value is . For small , . Thus, . The limit becomes .
Incorrect! Try again.
53You are training a neural network and observe that the training loss curve shows large, sustained oscillations that slowly decrease in amplitude. Which hyperparameter adjustment is most likely to mitigate this specific behavior?
Momentum-based optimization
Hard
A.Decreasing the learning rate () and increasing the momentum coefficient () simultaneously.
B.Decreasing the momentum coefficient () to reduce the overshoot causing the oscillations.
C.Increasing the batch size to get a more accurate gradient estimate.
D.Increasing the learning rate () to escape the oscillatory pattern.
Correct Answer: Decreasing the momentum coefficient () to reduce the overshoot causing the oscillations.
Explanation:
Large, sustained oscillations are a classic sign of the momentum term being too high. The optimizer overshoots the minimum in a valley of the loss landscape, and the accumulated momentum carries it too far up the other side. Reducing the momentum coefficient (e.g., from 0.99 to 0.9) dampens this behavior by placing more weight on the current gradient and less on the accumulated history, allowing the optimizer to settle.
Incorrect! Try again.
54Let be a convex function and be a random variable. According to Jensen's inequality, what is the relationship between and ? If is strictly convex and is not a constant, what can be further concluded?
Convex sets and convex functions
Hard
A., with strict inequality if is strictly convex.
B., with inequality only if the function is non-linear.
C., with equality holding if and only if is a linear function.
D., with strict inequality if is strictly convex and is not a constant.
Correct Answer: , with strict inequality if is strictly convex and is not a constant.
Explanation:
Jensen's inequality states that for a convex function , . This means the expectation of the function is greater than or equal to the function of the expectation. If the function is strictly convex and the random variable has non-zero variance (is not a constant), then the inequality becomes strict: .
Incorrect! Try again.
55Consider an objective function with a long, narrow, "ravine-like" valley. The gradient is large perpendicular to the ravine and small along it. How would Batch (BGD), Stochastic (SGD), and Mini-batch SGD likely perform?
Gradient descent and variants (batch, stochastic, mini-batch)
Hard
A.Mini-batch SGD would be the most inefficient due to the combined problems of zigzagging and gradient noise.
B.All three methods would perform identically, as the underlying gradient direction is the same on average.
C.BGD would zigzag inefficiently across the ravine; SGD's noise could help it move faster along the ravine; Mini-batch would offer a balance.
D.SGD would be the slowest due to its high variance, while BGD would move directly down the ravine.
Correct Answer: BGD would zigzag inefficiently across the ravine; SGD's noise could help it move faster along the ravine; Mini-batch would offer a balance.
Explanation:
BGD uses the true gradient, which points steeply across the ravine, causing it to zigzag inefficiently. SGD's high-variance updates, while also causing zigzagging, can have a random component along the ravine's axis, sometimes leading to faster progress in this specific pathology. Mini-batch SGD reduces the variance compared to SGD (less zigzagging) while being more efficient than BGD, offering a good compromise.
Incorrect! Try again.
56RMSProp was developed to address a major drawback of the AdaGrad algorithm. What is this critical flaw in AdaGrad that RMSProp rectifies?
RMSProp and Adam
Hard
A.AdaGrad uses a momentum term that can cause it to overshoot minima.
B.AdaGrad requires manual tuning of a global learning rate, which RMSProp automates completely.
C.AdaGrad's learning rate can sometimes increase, leading to instability, which RMSProp prevents.
D.AdaGrad's learning rate aggressively and monotonically decreases, often becoming infinitesimally small and prematurely stopping learning.
Correct Answer: AdaGrad's learning rate aggressively and monotonically decreases, often becoming infinitesimally small and prematurely stopping learning.
Explanation:
AdaGrad scales the learning rate by the square root of the sum of all past squared gradients. This sum grows continuously, causing the learning rate to monotonically decrease. For long training runs, the rate can become so small that learning stops. RMSProp fixes this by using an exponentially decaying moving average of squared gradients instead, allowing the optimizer to 'forget' the distant past and prevent the learning rate from vanishing.
Incorrect! Try again.
57In a data-parallel, synchronous distributed training setup using a parameter server, which strategy is most directly aimed at alleviating the communication bottleneck caused by synchronizing gradients?
Optimization considerations in large-scale ML systems
Hard
A.Increasing the number of worker nodes to parallelize the computation more effectively.
B.Using a more complex model with more parameters to increase computational load relative to communication.
C.Gradient quantization, where gradients are converted to lower-precision representations (e.g., 8-bit integers) before being sent over the network.
D.Decreasing the learning rate to ensure that smaller, less frequent updates are sufficient.
Correct Answer: Gradient quantization, where gradients are converted to lower-precision representations (e.g., 8-bit integers) before being sent over the network.
Explanation:
In synchronous distributed training, the time taken to transmit gradients over the network can become a bottleneck. Gradient quantization directly tackles this by reducing the size of the data being communicated. By converting 32-bit floating-point gradients to a lower-precision format (like 8-bit or 16-bit), the total data volume is drastically reduced, thus alleviating the communication bottleneck and speeding up the synchronization step.
Incorrect! Try again.
58For a differentiable function , what is the geometric relationship between the gradient vector at a point and the level set that passes through ?
Gradients and directional derivatives
Hard
A.The gradient vector is orthogonal (normal) to the tangent plane of the level set at .
B.The gradient vector is parallel to the tangent plane of the level set at .
C.The gradient vector points towards the direction of the minimum curvature on the level set .
D.There is no consistent geometric relationship between the gradient and the level set.
Correct Answer: The gradient vector is orthogonal (normal) to the tangent plane of the level set at .
Explanation:
The gradient vector points in the direction of steepest ascent. A level set is a surface where the function's value is constant. To move along the level set (in its tangent plane) means moving in a direction where the function value does not change. The directional derivative in any such tangent direction must be zero. Since , the gradient must be orthogonal to all tangent vectors, making it normal to the level set itself.
Incorrect! Try again.
59Consider an optimizer at a saddle point where the true gradient is zero, but there are directions of positive and negative curvature. How would SGD with Momentum behave differently from standard SGD?
Momentum-based optimization
Hard
A.Standard SGD would be stuck, while Momentum's accumulated velocity from previous steps would likely carry it through the saddle point.
B.Both algorithms would get stuck permanently, as the gradient is zero.
C.Both algorithms would escape, but Momentum would oscillate wildly around the saddle point before escaping.
D.Momentum would get stuck due to its tendency to dampen movement, while standard SGD's noise might allow it to escape.
Correct Answer: Standard SGD would be stuck, while Momentum's accumulated velocity from previous steps would likely carry it through the saddle point.
Explanation:
At a saddle point, the gradient is zero. Standard SGD would halt. However, SGD with Momentum maintains a velocity vector . Even if the current gradient is zero, the term from previous steps will be non-zero. This stored velocity allows the optimizer to 'coast' through the flat region of the saddle point and continue descending, making it much better at escaping saddle points.
Incorrect! Try again.
60For a function that is -smooth and -strongly convex, Batch Gradient Descent (BGD) achieves a linear convergence rate. What is the theoretical convergence rate for Stochastic Gradient Descent (SGD) with a decaying learning rate of ?
Gradient descent and variants (batch, stochastic, mini-batch)
Hard
A.Sublinear convergence, .
B.Sublinear convergence, .
C.Logarithmic convergence, .
D.Linear convergence, , where .
Correct Answer: Sublinear convergence, .
Explanation:
While BGD enjoys a fast linear (or geometric) convergence rate on strongly convex problems, SGD's convergence is fundamentally limited by the variance of its gradient estimates. Even with a carefully chosen diminishing learning rate schedule like , the convergence rate for SGD in terms of the objective function value is sublinear, specifically . This is significantly slower than BGD's linear rate, as the inherent noise prevents rapid, direct convergence.