Unit 4 - Practice Quiz

INT255 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the primary goal of an optimization algorithm in the context of training a machine learning model?

Optimization problem formulation in ML Easy
A. To make the model as complex as possible
B. To minimize the loss function
C. To maximize the number of features
D. To increase the size of the dataset

2 In machine learning, a 'loss function' or 'cost function' quantifies:

Optimization problem formulation in ML Easy
A. The model's overall accuracy on the test set
B. The computational resources used for training
C. The number of training iterations (epochs)
D. The error or 'cost' of the model's predictions

3 What is a key property of a convex function that makes it desirable for optimization?

Convex sets and convex functions Easy
A. It cannot be minimized
B. Any local minimum is also a global minimum
C. It has multiple local minima
D. It is always a linear function

4 A set of points is defined as a 'convex set' if:

Convex sets and convex functions Easy
A. It has a finite number of points
B. It is shaped like a perfect circle or sphere
C. It must contain the origin (0,0)
D. A line segment connecting any two points within the set lies entirely within the set

5 The gradient of a function at a specific point, denoted as , points in the direction of:

Gradients and directional derivatives Easy
A. The steepest descent
B. The function's origin
C. Zero change (a flat area)
D. The steepest ascent

6 In gradient-based optimization, why do we move in the direction of the negative gradient?

Gradients and directional derivatives Easy
A. To move towards a maximum of the function
B. Because the gradient is always a negative value
C. To move towards a minimum of the function
D. To increase the learning rate

7 If the gradient of a loss function at a certain point is zero, what does this indicate?

Gradients and directional derivatives Easy
A. The model has perfectly fit the data
B. The algorithm has encountered an error
C. The point is a critical point (minimum, maximum, or saddle point)
D. The learning rate is too high

8 What is the role of the 'learning rate' in the Gradient Descent algorithm?

Gradient descent and variants (batch, stochastic, mini-batch) Easy
A. It determines the step size taken during each iteration
B. It measures the accuracy of the model
C. It defines the size of the mini-batch
D. It specifies the total number of iterations

9 Which variant of Gradient Descent calculates the gradient using the entire training dataset for a single parameter update?

Gradient descent and variants (batch, stochastic, mini-batch) Easy
A. Stochastic Gradient Descent (SGD)
B. Adam
C. Batch Gradient Descent
D. Mini-batch Gradient Descent

10 Stochastic Gradient Descent (SGD) updates the model's parameters using:

Gradient descent and variants (batch, stochastic, mini-batch) Easy
A. Only the validation dataset
B. The entire training dataset
C. A small batch of training examples
D. A single training example at a time

11 What is the primary advantage of Mini-batch Gradient Descent over Batch Gradient Descent?

Gradient descent and variants (batch, stochastic, mini-batch) Easy
A. It does not require setting a learning rate
B. It is computationally more efficient and faster for large datasets
C. It always finds a better minimum
D. It guarantees convergence in fewer iterations

12 What is the core idea behind the Momentum optimization algorithm?

Momentum-based optimization Easy
A. To accelerate movement in the relevant direction and dampen oscillations
B. To use a different learning rate for every parameter
C. To only use data points that have high error
D. To randomly change direction to escape local minima

13 How does momentum help when the loss surface has long, narrow ravines (valleys)?

Momentum-based optimization Easy
A. It increases the learning rate to flatten the ravine
B. It forces the updates to jump out of the ravine
C. It stops the optimization process
D. It helps accelerate progress along the bottom of the ravine

14 The Adam optimizer is an adaptive learning rate method that combines the key ideas of which two other optimizers?

RMSProp and Adam Easy
A. Stochastic Gradient Descent and Batch Gradient Descent
B. L-BFGS and Momentum
C. Momentum and RMSProp
D. Adagrad and Newton's Method

15 What does an 'adaptive learning rate' mean in the context of optimizers like RMSProp and Adam?

RMSProp and Adam Easy
A. The optimizer maintains a separate learning rate for each model parameter
B. The learning rate is randomly chosen at each step
C. The user must manually adapt the learning rate during training
D. The learning rate steadily increases over time

16 What problem seen in other algorithms like Adagrad does RMSProp help to solve?

RMSProp and Adam Easy
A. The updates being too noisy
B. The use of too much memory
C. The learning rate becoming aggressively small and nearly stopping learning
D. The learning rate becoming too large and causing divergence

17 When training a deep learning model on a massive dataset, which gradient descent variant is most often preferred for practical reasons?

Optimization considerations in large-scale ML systems Easy
A. Mini-batch Gradient Descent
B. Grid Search
C. Batch Gradient Descent
D. Newton's Method

18 A common hardware-related challenge in large-scale machine learning is:

Optimization considerations in large-scale ML systems Easy
A. The computer case not having enough fans
B. Running out of hard drive space to store Python scripts
C. The keyboard wearing out from too much coding
D. Fitting the model and a batch of data into GPU memory (VRAM)

19 Why are convex optimization problems generally easier to solve than non-convex ones?

Convex sets and convex functions Easy
A. They require less data to solve
B. They can only be used for linear models
C. They always converge in a single step
D. They do not have local minima that could trap the optimization algorithm

20 The path of Stochastic Gradient Descent (SGD) towards the minimum is often described as 'noisy' or 'zig-zagging'. Why is this?

Gradient descent and variants (batch, stochastic, mini-batch) Easy
A. Because the gradient is estimated based on only one training sample at a time
B. Because the algorithm adds random noise on purpose
C. Because the learning rate is constantly increasing
D. Because it uses the entire dataset for each step

21 In a logistic regression model, the goal is to find parameters that maximize the likelihood of the training data. How is this typically formulated as a minimization problem for optimization algorithms like gradient descent?

Optimization problem formulation in ML Medium
A. By minimizing the likelihood directly
B. By maximizing the L2 norm of the parameters
C. By minimizing the negative log-likelihood
D. By minimizing the number of misclassified points

22 If a function is convex and differentiable, and we find a point where the gradient , what can we conclude about ?

Convex sets and convex functions Medium
A. is a global minimum.
B. could be a local maximum.
C. is a saddle point.
D. is a local minimum, but not necessarily global.

23 What is the primary trade-off when switching from Batch Gradient Descent (BGD) to Stochastic Gradient Descent (SGD) for a large dataset?

Gradient descent and variants (batch, stochastic, mini-batch) Medium
A. Trading faster, more frequent updates for a noisier, less direct convergence path.
B. Trading the need for a learning rate for automatic step size adjustment.
C. Trading a smooth convergence path for a higher computational cost per epoch.
D. Trading lower memory usage for a guaranteed convergence to the global minimum.

24 In the context of Gradient Descent, what is the main role of the momentum term?

Momentum-based optimization Medium
A. It guarantees that the algorithm will find the global minimum in non-convex functions.
B. It helps accelerate convergence in relevant directions and dampens oscillations in others.
C. It adapts the learning rate for each parameter individually based on past gradients.
D. It normalizes the gradient vector to have a unit length.

25 The RMSProp optimization algorithm was primarily designed to solve which specific problem encountered in the Adagrad algorithm?

RMSProp and Adam Medium
A. The tendency to get stuck in saddle points.
B. The aggressive and monotonically decreasing learning rate.
C. The inability to handle sparse data effectively.
D. The high computational cost of calculating second-order derivatives.

26 Which of the following functions is non-convex?

Convex sets and convex functions Medium
A. on the interval
B. (L2 norm)
C.
D.

27 For a function , what is the directional derivative at the point in the direction of the vector ?

Gradients and directional derivatives Medium
A. 18/5
B. 14/5
C. 18
D. 14

28 Which statement best describes the advantage of Mini-batch Gradient Descent over both Stochastic Gradient Descent (SGD) and Batch Gradient Descent (BGD)?

Gradient descent and variants (batch, stochastic, mini-batch) Medium
A. It balances the stability of BGD with the efficiency of SGD by using a small subset of data for each update.
B. It uses the full dataset for each update, but in a vectorized and more efficient manner than BGD.
C. It converges to the global minimum faster than both SGD and BGD on all types of problems.
D. It does not require tuning a learning rate, unlike SGD and BGD.

29 The Adam optimizer can be viewed as a combination of which two other optimization algorithms?

RMSProp and Adam Medium
A. Momentum and RMSProp
B. Newton's method and Batch Gradient Descent
C. L-BFGS and RMSProp
D. Adagrad and Nesterov Momentum

30 In a large-scale distributed training system using data parallelism, what is the typical procedure for a single training step?

Optimization considerations in large-scale ML systems Medium
A. The dataset is split, and each worker trains a completely independent model on its subset, which are later averaged.
B. The model itself is split across multiple workers, and each worker is responsible for computing a part of the forward and backward pass.
C. The model is replicated on multiple workers, each processes a different batch of data, and their resulting gradients are aggregated to update a central model.
D. A central server sends one data point at a time to each worker, waits for the gradient, and updates the model before sending the next point.

31 At a given point on the surface of a loss function , the negative gradient vector, , points in which direction?

Gradients and directional derivatives Medium
A. The direction of steepest descent.
B. A direction tangent to the contour line.
C. The direction of steepest ascent.
D. The direction directly towards the global minimum.

32 In the context of a machine learning optimization problem, what is the primary purpose of adding a regularization term (e.g., L1 or L2) to the loss function?

Optimization problem formulation in ML Medium
A. To speed up the convergence of gradient descent.
B. To ensure the loss function is always convex.
C. To penalize model complexity and prevent overfitting.
D. To increase the model's accuracy on the training data.

33 How does Nesterov Accelerated Gradient (NAG) differ fundamentally from standard Momentum?

Momentum-based optimization Medium
A. NAG uses a fixed momentum coefficient of 0.9, while standard momentum requires tuning.
B. NAG is a second-order optimization method, whereas standard momentum is a first-order method.
C. NAG calculates the gradient at a 'lookahead' position, after applying the current velocity, rather than at the current position.
D. NAG uses a decaying average of squared gradients instead of a simple velocity term.

34 You are observing the training loss curve for a model. The loss decreases overall, but it is highly erratic and noisy from one iteration to the next. Which optimization variant is most likely being used?

Gradient descent and variants (batch, stochastic, mini-batch) Medium
A. L-BFGS
B. Newton's Method
C. Batch Gradient Descent (BGD)
D. Stochastic Gradient Descent (SGD)

35 For a convex function , Jensen's inequality states that for any set of points in its domain and any non-negative weights that sum to 1, which of the following is true?

Convex sets and convex functions Medium
A.
B.
C.
D.

36 In the Adam optimizer, what is the purpose of the bias correction step for the first and second moment estimates ( and )?

RMSProp and Adam Medium
A. To counteract the fact that the moment estimates are initialized at zero and are therefore biased towards zero, especially during initial steps.
B. To ensure the learning rate remains positive throughout training.
C. To normalize the gradients to prevent them from becoming too large (exploding gradients).
D. To add a momentum-like term to the update rule.

37 If the gradient of a non-convex, differentiable function is zero at a point , i.e., , what can be said about ?

Gradients and directional derivatives Medium
A. must be a saddle point.
B. must be a local maximum.
C. must be a global minimum.
D. is a stationary point, which could be a local minimum, local maximum, or a saddle point.

38 When would a machine learning engineer most likely choose model parallelism over data parallelism for training a large neural network?

Optimization considerations in large-scale ML systems Medium
A. When the dataset is extremely large but the model is relatively small.
B. When the model is too large to fit into the memory of a single GPU/accelerator.
C. When training on a single machine with multiple cores.
D. When they want to reduce the noise in gradient updates.

39 In momentum-based gradient descent, what is a potential downside of setting the momentum coefficient (e.g., ) very close to 1 (e.g., 0.999)?

Momentum-based optimization Medium
A. The update rule will effectively become equivalent to standard SGD.
B. The optimizer may overshoot the minimum and struggle to stop, especially if the learning rate is not decreased.
C. The optimizer will converge much more slowly as it will heavily dampen all updates.
D. The memory required for training will increase quadratically.

40 You are training a model on a dataset with highly redundant data (e.g., many nearly identical images). Which optimization variant would likely offer the most significant computational speedup over Batch Gradient Descent without a major loss in convergence quality?

Gradient descent and variants (batch, stochastic, mini-batch) Medium
A. Conjugate Gradient.
B. L-BFGS.
C. Newton's Method.
D. Mini-batch or Stochastic Gradient Descent.

41 Consider the Support Vector Machine (SVM) optimization problem with a soft margin. Which of the following statements about its dual formulation is most accurate?

Optimization problem formulation in ML Hard
A. The dual problem involves Lagrange multipliers that are constrained to be negative and sum to one.
B. The dual problem is an unconstrained quadratic programming problem whose dimensionality depends on the number of features.
C. The dual problem's objective function is maximized, and its dimensionality is determined by the number of training samples, making it suitable for high-dimensional feature spaces.
D. The dual problem is computationally equivalent to the primal for all kernel types, offering no specific advantage.

42 Let be a function. The epigraph of , denoted epi(f), is a subset of . What is the most precise relationship between the convexity of the function and the convexity of its epigraph epi(f)?

Convex sets and convex functions Hard
A. If epi(f) is a convex set, then must be a convex function, but the converse is not always true.
B. The convexity of is completely independent of the convexity of epi(f).
C. is a convex function if and only if epi(f) is a convex set.
D. If is a convex function, then epi(f) must be a convex set, but the converse is not always true.

43 Consider the function . What is the subgradient at the point ?

Gradients and directional derivatives Hard
A. The set of all vectors such that where .
B. A unique vector .
C. The empty set, as the gradient does not exist.
D. The set of all vectors such that where .

44 For a function that is -smooth but not necessarily convex, which statement about the convergence of Batch Gradient Descent (BGD) with a fixed learning rate is most accurate?

Gradient descent and variants (batch, stochastic, mini-batch) Hard
A. BGD is guaranteed to converge to a global minimum if .
B. BGD is guaranteed to converge to a local minimum for any .
C. BGD may diverge, oscillate, or converge to a stationary point, but it is not guaranteed to converge to even a local minimum.
D. BGD is guaranteed to converge to a stationary point (where ) if .

45 How does Nesterov Accelerated Gradient (NAG) differ fundamentally from classical momentum in its update rule?

Momentum-based optimization Hard
A. NAG calculates the gradient and the momentum update completely independently and combines them with a weighted average.
B. NAG first takes a step in the direction of the accumulated momentum, computes the gradient at this "lookahead" position, and then makes a correction.
C. NAG uses a larger momentum term () than classical momentum, leading to faster acceleration.
D. NAG computes the gradient at the current position, then takes a large step in the direction of the accumulated momentum.

46 In the Adam optimization algorithm, what is the primary purpose of the bias correction step for the first and second moment estimates ( and )?

RMSProp and Adam Hard
A. To initialize the moment estimates to non-zero values, avoiding division by zero.
B. To prevent the learning rate from becoming too large during the initial stages of training.
C. To counteract the tendency of the moment estimates to be biased towards zero, especially during the initial timesteps.
D. To normalize the gradients, ensuring they have a unit norm before being used in the update.

47 In a large-scale distributed training system using asynchronous stochastic gradient descent (Async-SGD), what is the primary challenge that can impede convergence compared to its synchronous counterpart?

Optimization considerations in large-scale ML systems Hard
A. "Stale gradients," where a worker updates the central model using a gradient computed based on parameters that are already outdated.
B. Increased memory overhead on the parameter server to store multiple versions of the model.
C. Network latency causing some workers to become completely idle.
D. The requirement for a high-bandwidth connection between the parameter server and workers.

48 Let and be two convex functions defined on . Which of the following functions is NOT guaranteed to be convex?

Convex sets and convex functions Hard
A. for non-negative scalars .
B.
C.
D. , assuming is also non-decreasing.

49 In Stochastic Gradient Descent (SGD), how does the variance of the gradient estimate typically behave as the optimizer approaches a local minimum?

Gradient descent and variants (batch, stochastic, mini-batch) Hard
A. The variance remains constant regardless of the proximity to the minimum.
B. The variance increases exponentially as the optimizer gets closer to the minimum.
C. The variance decreases to zero as the true gradient approaches zero.
D. The variance does not approach zero, which prevents SGD with a fixed learning rate from converging to the exact minimum.

50 Despite its general effectiveness, Adam has been observed to fail to converge on certain simple, convex optimization problems where standard SGD with momentum converges. What is a key theoretical reason for this phenomenon?

RMSProp and Adam Hard
A. The algorithm can be overly sensitive to the choice of the epsilon () parameter.
B. The adaptive learning rate can become excessively small in directions with historically large but currently informative gradients.
C. The second moment estimate can cause the effective learning rate to be dominated by past, irrelevant gradient information, leading to convergence to a suboptimal point.
D. The use of bias correction causes instability in later stages of training.

51 When formulating a linear regression problem, we can add an L1 (Lasso) or L2 (Ridge) regularization term. From an optimization landscape perspective, what is the most significant difference between the objective functions created by these two regularizers?

Optimization problem formulation in ML Hard
A. The L2-regularized objective is smooth and strictly convex, while the L1-regularized objective is non-smooth but still strictly convex.
B. The L1-regularized objective function is non-differentiable at points where any coefficient is zero, leading to sparse solutions, while the L2 objective is smooth everywhere.
C. The L2-regularized objective has a unique global minimum, whereas the L1-regularized objective can have multiple global minima.
D. The L1 regularizer penalizes large coefficients more heavily than the L2 regularizer, preventing any single weight from dominating.

52 Consider the function . What is the directional derivative of at the point in the direction ?

Gradients and directional derivatives Hard
A. 0
B. The directional derivative does not exist.
C. -4
D. -1

53 You are training a neural network and observe that the training loss curve shows large, sustained oscillations that slowly decrease in amplitude. Which hyperparameter adjustment is most likely to mitigate this specific behavior?

Momentum-based optimization Hard
A. Decreasing the learning rate () and increasing the momentum coefficient () simultaneously.
B. Decreasing the momentum coefficient () to reduce the overshoot causing the oscillations.
C. Increasing the batch size to get a more accurate gradient estimate.
D. Increasing the learning rate () to escape the oscillatory pattern.

54 Let be a convex function and be a random variable. According to Jensen's inequality, what is the relationship between and ? If is strictly convex and is not a constant, what can be further concluded?

Convex sets and convex functions Hard
A. , with strict inequality if is strictly convex.
B. , with inequality only if the function is non-linear.
C. , with equality holding if and only if is a linear function.
D. , with strict inequality if is strictly convex and is not a constant.

55 Consider an objective function with a long, narrow, "ravine-like" valley. The gradient is large perpendicular to the ravine and small along it. How would Batch (BGD), Stochastic (SGD), and Mini-batch SGD likely perform?

Gradient descent and variants (batch, stochastic, mini-batch) Hard
A. Mini-batch SGD would be the most inefficient due to the combined problems of zigzagging and gradient noise.
B. All three methods would perform identically, as the underlying gradient direction is the same on average.
C. BGD would zigzag inefficiently across the ravine; SGD's noise could help it move faster along the ravine; Mini-batch would offer a balance.
D. SGD would be the slowest due to its high variance, while BGD would move directly down the ravine.

56 RMSProp was developed to address a major drawback of the AdaGrad algorithm. What is this critical flaw in AdaGrad that RMSProp rectifies?

RMSProp and Adam Hard
A. AdaGrad uses a momentum term that can cause it to overshoot minima.
B. AdaGrad requires manual tuning of a global learning rate, which RMSProp automates completely.
C. AdaGrad's learning rate can sometimes increase, leading to instability, which RMSProp prevents.
D. AdaGrad's learning rate aggressively and monotonically decreases, often becoming infinitesimally small and prematurely stopping learning.

57 In a data-parallel, synchronous distributed training setup using a parameter server, which strategy is most directly aimed at alleviating the communication bottleneck caused by synchronizing gradients?

Optimization considerations in large-scale ML systems Hard
A. Increasing the number of worker nodes to parallelize the computation more effectively.
B. Using a more complex model with more parameters to increase computational load relative to communication.
C. Gradient quantization, where gradients are converted to lower-precision representations (e.g., 8-bit integers) before being sent over the network.
D. Decreasing the learning rate to ensure that smaller, less frequent updates are sufficient.

58 For a differentiable function , what is the geometric relationship between the gradient vector at a point and the level set that passes through ?

Gradients and directional derivatives Hard
A. The gradient vector is orthogonal (normal) to the tangent plane of the level set at .
B. The gradient vector is parallel to the tangent plane of the level set at .
C. The gradient vector points towards the direction of the minimum curvature on the level set .
D. There is no consistent geometric relationship between the gradient and the level set.

59 Consider an optimizer at a saddle point where the true gradient is zero, but there are directions of positive and negative curvature. How would SGD with Momentum behave differently from standard SGD?

Momentum-based optimization Hard
A. Standard SGD would be stuck, while Momentum's accumulated velocity from previous steps would likely carry it through the saddle point.
B. Both algorithms would get stuck permanently, as the gradient is zero.
C. Both algorithms would escape, but Momentum would oscillate wildly around the saddle point before escaping.
D. Momentum would get stuck due to its tendency to dampen movement, while standard SGD's noise might allow it to escape.

60 For a function that is -smooth and -strongly convex, Batch Gradient Descent (BGD) achieves a linear convergence rate. What is the theoretical convergence rate for Stochastic Gradient Descent (SGD) with a decaying learning rate of ?

Gradient descent and variants (batch, stochastic, mini-batch) Hard
A. Sublinear convergence, .
B. Sublinear convergence, .
C. Logarithmic convergence, .
D. Linear convergence, , where .