1 $What is the primary goal of an optimization algorithm in the context of training a machine learning model?$

Optimization problem formulation in ML Easy

A.

To increase the size of the dataset

B.

To maximize the number of features

C.

To make the model as complex as possible

D.

To minimize the loss function

2 $In machine learning, a 'loss function' or 'cost function' quantifies:$

Optimization problem formulation in ML Easy

A.

The error or 'cost' of the model's predictions

B.

The computational resources used for training

C.

The model's overall accuracy on the test set

D.

The number of training iterations (epochs)

3 $What is a key property of a convex function that makes it desirable for optimization?$

Convex sets and convex functions Easy

A.

It is always a linear function

B.

It has multiple local minima

C.

It cannot be minimized

D.

Any local minimum is also a global minimum

4 $A set of points is defined as a 'convex set' if:$

Convex sets and convex functions Easy

A.

It has a finite number of points

B.

A line segment connecting any two points within the set lies entirely within the set

C.

It must contain the origin (0,0)

D.

It is shaped like a perfect circle or sphere

5 $The gradient of a function at a specific point, denoted as, points in the direction of:$

Gradients and directional derivatives Easy

A.

The function's origin

B.

The steepest ascent

C.

Zero change (a flat area)

D.

The steepest descent

6 $In gradient-based optimization, why do we move in the direction of the negative gradient?$

Gradients and directional derivatives Easy

A.

To move towards a maximum of the function

B.

To move towards a minimum of the function

C.

Because the gradient is always a negative value

D.

To increase the learning rate

7 $If the gradient of a loss function at a certain point is zero, what does this indicate?$

Gradients and directional derivatives Easy

A.

The learning rate is too high

B.

The point is a critical point (minimum, maximum, or saddle point)

C.

The model has perfectly fit the data

D.

The algorithm has encountered an error

8 $What is the role of the 'learning rate' in the Gradient Descent algorithm?$

Gradient descent and variants (batch, stochastic, mini-batch) Easy

A.

It defines the size of the mini-batch

B.

It measures the accuracy of the model

C.

It specifies the total number of iterations

D.

It determines the step size taken during each iteration

9 $Which variant of Gradient Descent calculates the gradient using the entire training dataset for a single parameter update?$

Gradient descent and variants (batch, stochastic, mini-batch) Easy

A.

Stochastic Gradient Descent (SGD)

B.

Mini-batch Gradient Descent

C.

Batch Gradient Descent

D.

Adam

10 $Stochastic Gradient Descent (SGD) updates the model's parameters using:$

Gradient descent and variants (batch, stochastic, mini-batch) Easy

A.

A single training example at a time

B.

A small batch of training examples

C.

Only the validation dataset

D.

The entire training dataset

11 $What is the primary advantage of Mini-batch Gradient Descent over Batch Gradient Descent?$

Gradient descent and variants (batch, stochastic, mini-batch) Easy

A.

It always finds a better minimum

B.

It guarantees convergence in fewer iterations

C.

It is computationally more efficient and faster for large datasets

D.

It does not require setting a learning rate

12 $What is the core idea behind the Momentum optimization algorithm?$

Momentum-based optimization Easy

A.

To only use data points that have high error

B.

To accelerate movement in the relevant direction and dampen oscillations

C.

To use a different learning rate for every parameter

D.

To randomly change direction to escape local minima

13 $How does momentum help when the loss surface has long, narrow ravines (valleys)?$

Momentum-based optimization Easy

A.

It forces the updates to jump out of the ravine

B.

It helps accelerate progress along the bottom of the ravine

C.

It stops the optimization process

D.

It increases the learning rate to flatten the ravine

14 $The Adam optimizer is an adaptive learning rate method that combines the key ideas of which two other optimizers?$

RMSProp and Adam Easy

A.

Momentum and RMSProp

B.

L-BFGS and Momentum

C.

Adagrad and Newton's Method

D.

Stochastic Gradient Descent and Batch Gradient Descent

15 $What does an 'adaptive learning rate' mean in the context of optimizers like RMSProp and Adam?$

RMSProp and Adam Easy

A.

The learning rate steadily increases over time

B.

The user must manually adapt the learning rate during training

C.

The optimizer maintains a separate learning rate for each model parameter

D.

The learning rate is randomly chosen at each step

16 $What problem seen in other algorithms like Adagrad does RMSProp help to solve?$

RMSProp and Adam Easy

A.

The learning rate becoming aggressively small and nearly stopping learning

B.

The learning rate becoming too large and causing divergence

C.

The updates being too noisy

D.

The use of too much memory

17 $When training a deep learning model on a massive dataset, which gradient descent variant is most often preferred for practical reasons?$

Optimization considerations in large-scale ML systems Easy

A.

Mini-batch Gradient Descent

B.

Newton's Method

C.

Batch Gradient Descent

D.

Grid Search

18 $A common hardware-related challenge in large-scale machine learning is:$

Optimization considerations in large-scale ML systems Easy

A.

Fitting the model and a batch of data into GPU memory (VRAM)

B.

Running out of hard drive space to store Python scripts

C.

The computer case not having enough fans

D.

The keyboard wearing out from too much coding

19 $Why are convex optimization problems generally easier to solve than non-convex ones?$

Convex sets and convex functions Easy

A.

They require less data to solve

B.

They always converge in a single step

C.

They can only be used for linear models

D.

They do not have local minima that could trap the optimization algorithm

20 $The path of Stochastic Gradient Descent (SGD) towards the minimum is often described as 'noisy' or 'zig-zagging'. Why is this?$

Gradient descent and variants (batch, stochastic, mini-batch) Easy

A.

Because the algorithm adds random noise on purpose

B.

Because the learning rate is constantly increasing

C.

Because the gradient is estimated based on only one training sample at a time

D.

Because it uses the entire dataset for each step

21 $In a logistic regression model, the goal is to find parameters that maximize the likelihood of the training data. How is this typically formulated as a minimization problem for optimization algorithms like gradient descent?$

Optimization problem formulation in ML Medium

A.

By maximizing the L2 norm of the parameters

B.

By minimizing the number of misclassified points

C.

By minimizing the likelihood directly

D.

By minimizing the negative log-likelihood

22 $If a function is convex and differentiable, and we find a point where the gradient, what can we conclude about ?$

Convex sets and convex functions Medium

A.

is a global minimum.

B.

could be a local maximum.

C.

is a local minimum, but not necessarily global.

D.

is a saddle point.

23 $What is the primary trade-off when switching from Batch Gradient Descent (BGD) to Stochastic Gradient Descent (SGD) for a large dataset?$

Gradient descent and variants (batch, stochastic, mini-batch) Medium

A.

Trading the need for a learning rate for automatic step size adjustment.

B.

Trading a smooth convergence path for a higher computational cost per epoch.

C.

Trading faster, more frequent updates for a noisier, less direct convergence path.

D.

Trading lower memory usage for a guaranteed convergence to the global minimum.

24 $In the context of Gradient Descent, what is the main role of the momentum term?$

Momentum-based optimization Medium

A.

It guarantees that the algorithm will find the global minimum in non-convex functions.

B.

It helps accelerate convergence in relevant directions and dampens oscillations in others.

C.

It adapts the learning rate for each parameter individually based on past gradients.

D.

It normalizes the gradient vector to have a unit length.

25 $The RMSProp optimization algorithm was primarily designed to solve which specific problem encountered in the Adagrad algorithm?$

RMSProp and Adam Medium

A.

The inability to handle sparse data effectively.

B.

The aggressive and monotonically decreasing learning rate.

C.

The high computational cost of calculating second-order derivatives.

D.

The tendency to get stuck in saddle points.

26 $Which of the following functions is non-convex?$

Convex sets and convex functions Medium

A.

B.

on the interval

C.

(L2 norm)

D.

27 $For a function, what is the directional derivative at the point in the direction of the vector ?$

Gradients and directional derivatives Medium

A.

14

B.

18/5

C.

14/5

D.

18

28 $Which statement best describes the advantage of Mini-batch Gradient Descent over both Stochastic Gradient Descent (SGD) and Batch Gradient Descent (BGD)?$

Gradient descent and variants (batch, stochastic, mini-batch) Medium

A.

It does not require tuning a learning rate, unlike SGD and BGD.

B.

It converges to the global minimum faster than both SGD and BGD on all types of problems.

C.

It uses the full dataset for each update, but in a vectorized and more efficient manner than BGD.

D.

It balances the stability of BGD with the efficiency of SGD by using a small subset of data for each update.

29 $The Adam optimizer can be viewed as a combination of which two other optimization algorithms?$

RMSProp and Adam Medium

A.

Adagrad and Nesterov Momentum

B.

Newton's method and Batch Gradient Descent

C.

Momentum and RMSProp

D.

L-BFGS and RMSProp

30 $In a large-scale distributed training system using data parallelism, what is the typical procedure for a single training step?$

Optimization considerations in large-scale ML systems Medium

A.

The model itself is split across multiple workers, and each worker is responsible for computing a part of the forward and backward pass.

B.

The model is replicated on multiple workers, each processes a different batch of data, and their resulting gradients are aggregated to update a central model.

C.

The dataset is split, and each worker trains a completely independent model on its subset, which are later averaged.

D.

A central server sends one data point at a time to each worker, waits for the gradient, and updates the model before sending the next point.

31 $At a given point on the surface of a loss function, the negative gradient vector,, points in which direction?$

Gradients and directional derivatives Medium

A.

The direction of steepest ascent.

B.

The direction directly towards the global minimum.

C.

A direction tangent to the contour line.

D.

The direction of steepest descent.

32 $In the context of a machine learning optimization problem, what is the primary purpose of adding a regularization term (e.g., L1 or L2) to the loss function?$

Optimization problem formulation in ML Medium

A.

To penalize model complexity and prevent overfitting.

B.

To increase the model's accuracy on the training data.

C.

To speed up the convergence of gradient descent.

D.

To ensure the loss function is always convex.

33 $How does Nesterov Accelerated Gradient (NAG) differ fundamentally from standard Momentum?$

Momentum-based optimization Medium

A.

NAG uses a fixed momentum coefficient of 0.9, while standard momentum requires tuning.

B.

NAG uses a decaying average of squared gradients instead of a simple velocity term.

C.

NAG calculates the gradient at a 'lookahead' position, after applying the current velocity, rather than at the current position.

D.

NAG is a second-order optimization method, whereas standard momentum is a first-order method.

34 $You are observing the training loss curve for a model. The loss decreases overall, but it is highly erratic and noisy from one iteration to the next. Which optimization variant is most likely being used?$

Gradient descent and variants (batch, stochastic, mini-batch) Medium

A.

L-BFGS

B.

Newton's Method

C.

Stochastic Gradient Descent (SGD)

D.

Batch Gradient Descent (BGD)

35 $For a convex function, Jensen's inequality states that for any set of points in its domain and any non-negative weights that sum to 1, which of the following is true?$

Convex sets and convex functions Medium

A.

B.

C.

D.

36 $In the Adam optimizer, what is the purpose of the bias correction step for the first and second moment estimates (and)?$

RMSProp and Adam Medium

A.

To add a momentum-like term to the update rule.

B.

To ensure the learning rate remains positive throughout training.

C.

To normalize the gradients to prevent them from becoming too large (exploding gradients).

D.

To counteract the fact that the moment estimates are initialized at zero and are therefore biased towards zero, especially during initial steps.

37 $If the gradient of a non-convex, differentiable function is zero at a point, i.e.,, what can be said about ?$

Gradients and directional derivatives Medium

A.

must be a global minimum.

B.

is a stationary point, which could be a local minimum, local maximum, or a saddle point.

C.

must be a saddle point.

D.

must be a local maximum.

38 $When would a machine learning engineer most likely choose model parallelism over data parallelism for training a large neural network?$

Optimization considerations in large-scale ML systems Medium

A.

When the dataset is extremely large but the model is relatively small.

B.

When they want to reduce the noise in gradient updates.

C.

When training on a single machine with multiple cores.

D.

When the model is too large to fit into the memory of a single GPU/accelerator.

39 $In momentum-based gradient descent, what is a potential downside of setting the momentum coefficient (e.g.,) very close to 1 (e.g., 0.999)?$

Momentum-based optimization Medium

A.

The update rule will effectively become equivalent to standard SGD.

B.

The optimizer may overshoot the minimum and struggle to stop, especially if the learning rate is not decreased.

C.

The memory required for training will increase quadratically.

D.

The optimizer will converge much more slowly as it will heavily dampen all updates.

40 $You are training a model on a dataset with highly redundant data (e.g., many nearly identical images). Which optimization variant would likely offer the most significant computational speedup over Batch Gradient Descent without a major loss in convergence quality?$

Gradient descent and variants (batch, stochastic, mini-batch) Medium

A.

Mini-batch or Stochastic Gradient Descent.

B.

L-BFGS.

C.

Conjugate Gradient.

D.

Newton's Method.

41 $Consider the Support Vector Machine (SVM) optimization problem with a soft margin. Which of the following statements about its dual formulation is most accurate?$

Optimization problem formulation in ML Hard

A.

The dual problem involves Lagrange multipliers that are constrained to be negative and sum to one.

B.

The dual problem is computationally equivalent to the primal for all kernel types, offering no specific advantage.

C.

The dual problem's objective function is maximized, and its dimensionality is determined by the number of training samples, making it suitable for high-dimensional feature spaces.

D.

The dual problem is an unconstrained quadratic programming problem whose dimensionality depends on the number of features.

42 $Let be a function. The epigraph of, denoted epi(f), is a subset of . What is the most precise relationship between the convexity of the function and the convexity of its epigraph epi(f) ?$

Convex sets and convex functions Hard

A.

The convexity of is completely independent of the convexity of epi(f) .

B.

is a convex function if and only if epi(f) is a convex set.

C.

If is a convex function, then epi(f) must be a convex set, but the converse is not always true.

D.

If epi(f) is a convex set, then must be a convex function, but the converse is not always true.

43 $Consider the function . What is the subgradient at the point ?$

Gradients and directional derivatives Hard

A.

The empty set, as the gradient does not exist.

B.

The set of all vectors such that where .

C.

A unique vector .

D.

The set of all vectors such that where .

44 $For a function that is -smooth but not necessarily convex, which statement about the convergence of Batch Gradient Descent (BGD) with a fixed learning rate is most accurate?$

Gradient descent and variants (batch, stochastic, mini-batch) Hard

A.

BGD may diverge, oscillate, or converge to a stationary point, but it is not guaranteed to converge to even a local minimum.

B.

BGD is guaranteed to converge to a local minimum for any .

C.

BGD is guaranteed to converge to a global minimum if .

D.

BGD is guaranteed to converge to a stationary point (where) if .

45 $How does Nesterov Accelerated Gradient (NAG) differ fundamentally from classical momentum in its update rule?$

Momentum-based optimization Hard

A.

NAG uses a larger momentum term () than classical momentum, leading to faster acceleration.

B.

NAG calculates the gradient and the momentum update completely independently and combines them with a weighted average.

C.

NAG first takes a step in the direction of the accumulated momentum, computes the gradient at this "lookahead" position, and then makes a correction.

D.

NAG computes the gradient at the current position, then takes a large step in the direction of the accumulated momentum.

46 $In the Adam optimization algorithm, what is the primary purpose of the bias correction step for the first and second moment estimates (and)?$

RMSProp and Adam Hard

A.

To normalize the gradients, ensuring they have a unit norm before being used in the update.

B.

To prevent the learning rate from becoming too large during the initial stages of training.

C.

To initialize the moment estimates to non-zero values, avoiding division by zero.

D.

To counteract the tendency of the moment estimates to be biased towards zero, especially during the initial timesteps.

47 $In a large-scale distributed training system using asynchronous stochastic gradient descent (Async-SGD), what is the primary challenge that can impede convergence compared to its synchronous counterpart?$

Optimization considerations in large-scale ML systems Hard

A.

The requirement for a high-bandwidth connection between the parameter server and workers.

B.

Network latency causing some workers to become completely idle.

C.

"Stale gradients," where a worker updates the central model using a gradient computed based on parameters that are already outdated.

D.

Increased memory overhead on the parameter server to store multiple versions of the model.

48 $Let and be two convex functions defined on . Which of the following functions is NOT guaranteed to be convex?$

Convex sets and convex functions Hard

A.

B.

, assuming is also non-decreasing.

C.

for non-negative scalars .

D.

49 $In Stochastic Gradient Descent (SGD), how does the variance of the gradient estimate typically behave as the optimizer approaches a local minimum?$

Gradient descent and variants (batch, stochastic, mini-batch) Hard

A.

The variance does not approach zero, which prevents SGD with a fixed learning rate from converging to the exact minimum.

B.

The variance increases exponentially as the optimizer gets closer to the minimum.

C.

The variance remains constant regardless of the proximity to the minimum.

D.

The variance decreases to zero as the true gradient approaches zero.

50 $Despite its general effectiveness, Adam has been observed to fail to converge on certain simple, convex optimization problems where standard SGD with momentum converges. What is a key theoretical reason for this phenomenon?$

RMSProp and Adam Hard

A.

The use of bias correction causes instability in later stages of training.

B.

The adaptive learning rate can become excessively small in directions with historically large but currently informative gradients.

C.

The algorithm can be overly sensitive to the choice of the epsilon () parameter.

D.

The second moment estimate can cause the effective learning rate to be dominated by past, irrelevant gradient information, leading to convergence to a suboptimal point.

51 $When formulating a linear regression problem, we can add an L1 (Lasso) or L2 (Ridge) regularization term. From an optimization landscape perspective, what is the most significant difference between the objective functions created by these two regularizers?$

Optimization problem formulation in ML Hard

A.

The L2-regularized objective has a unique global minimum, whereas the L1-regularized objective can have multiple global minima.

B.

The L1-regularized objective function is non-differentiable at points where any coefficient is zero, leading to sparse solutions, while the L2 objective is smooth everywhere.

C.

The L1 regularizer penalizes large coefficients more heavily than the L2 regularizer, preventing any single weight from dominating.

D.

The L2-regularized objective is smooth and strictly convex, while the L1-regularized objective is non-smooth but still strictly convex.

52 $Consider the function . What is the directional derivative of at the point in the direction ?$

Gradients and directional derivatives Hard

A.

0

B.

-1

C.

The directional derivative does not exist.

D.

-4

53 $You are training a neural network and observe that the training loss curve shows large, sustained oscillations that slowly decrease in amplitude. Which hyperparameter adjustment is most likely to mitigate this specific behavior?$

Momentum-based optimization Hard

A.

Increasing the learning rate () to escape the oscillatory pattern.

B.

Increasing the batch size to get a more accurate gradient estimate.

C.

Decreasing the learning rate () and increasing the momentum coefficient () simultaneously.

D.

Decreasing the momentum coefficient () to reduce the overshoot causing the oscillations.

54 $Let be a convex function and be a random variable. According to Jensen's inequality, what is the relationship between and ? If is strictly convex and is not a constant, what can be further concluded?$

Convex sets and convex functions Hard

A.

, with inequality only if the function is non-linear.

B.

, with strict inequality if is strictly convex.

C.

, with equality holding if and only if is a linear function.

D.

, with strict inequality if is strictly convex and is not a constant.

55 $Consider an objective function with a long, narrow, "ravine-like" valley. The gradient is large perpendicular to the ravine and small along it. How would Batch (BGD), Stochastic (SGD), and Mini-batch SGD likely perform?$

Gradient descent and variants (batch, stochastic, mini-batch) Hard

A.

Mini-batch SGD would be the most inefficient due to the combined problems of zigzagging and gradient noise.

B.

BGD would zigzag inefficiently across the ravine; SGD's noise could help it move faster along the ravine; Mini-batch would offer a balance.

C.

All three methods would perform identically, as the underlying gradient direction is the same on average.

D.

SGD would be the slowest due to its high variance, while BGD would move directly down the ravine.

56 $RMSProp was developed to address a major drawback of the AdaGrad algorithm. What is this critical flaw in AdaGrad that RMSProp rectifies?$

RMSProp and Adam Hard

A.

AdaGrad's learning rate aggressively and monotonically decreases, often becoming infinitesimally small and prematurely stopping learning.

B.

AdaGrad uses a momentum term that can cause it to overshoot minima.

C.

AdaGrad requires manual tuning of a global learning rate, which RMSProp automates completely.

D.

AdaGrad's learning rate can sometimes increase, leading to instability, which RMSProp prevents.

57 $In a data-parallel, synchronous distributed training setup using a parameter server, which strategy is most directly aimed at alleviating the communication bottleneck caused by synchronizing gradients?$

Optimization considerations in large-scale ML systems Hard

A.

Using a more complex model with more parameters to increase computational load relative to communication.

B.

Gradient quantization, where gradients are converted to lower-precision representations (e.g., 8-bit integers) before being sent over the network.

C.

Increasing the number of worker nodes to parallelize the computation more effectively.

D.

Decreasing the learning rate to ensure that smaller, less frequent updates are sufficient.

58 $For a differentiable function, what is the geometric relationship between the gradient vector at a point and the level set that passes through ?$

Gradients and directional derivatives Hard

A.

The gradient vector points towards the direction of the minimum curvature on the level set .

B.

The gradient vector is orthogonal (normal) to the tangent plane of the level set at .

C.

The gradient vector is parallel to the tangent plane of the level set at .

D.

There is no consistent geometric relationship between the gradient and the level set.

59 $Consider an optimizer at a saddle point where the true gradient is zero, but there are directions of positive and negative curvature. How would SGD with Momentum behave differently from standard SGD?$

Momentum-based optimization Hard

A.

Both algorithms would escape, but Momentum would oscillate wildly around the saddle point before escaping.

B.

Momentum would get stuck due to its tendency to dampen movement, while standard SGD's noise might allow it to escape.

C.

Standard SGD would be stuck, while Momentum's accumulated velocity from previous steps would likely carry it through the saddle point.

D.

Both algorithms would get stuck permanently, as the gradient is zero.

60 $For a function that is -smooth and -strongly convex, Batch Gradient Descent (BGD) achieves a linear convergence rate. What is the theoretical convergence rate for Stochastic Gradient Descent (SGD) with a decaying learning rate of ?$

Gradient descent and variants (batch, stochastic, mini-batch) Hard

A.

Linear convergence,, where .

B.

Logarithmic convergence, .

C.

Sublinear convergence, .

D.

Sublinear convergence, .

Unit 4 - Practice Quiz