1 $In statistics, what does the term 'Population' refer to?$

Population Easy

A.

The entire group of individuals or objects that we are interested in studying.

B.

The statistical model used for data analysis.

C.

The final result or conclusion of an experiment.

D.

A small, manageable subset of a group used for analysis.

2 $What is the primary purpose of 'sampling' in machine learning and statistics?$

Sampling Easy

A.

To increase the complexity and size of a dataset.

B.

To select a representative subset from a larger population to make inferences about the population.

C.

To prove a hypothesis with absolute certainty.

D.

To analyze every single member of the population in detail.

3 $In hypothesis testing, what is the 'null hypothesis' ()?$

Hypothesis Testing Easy

A.

The final conclusion reached after the statistical test.

B.

The hypothesis that the researcher is trying to prove is true.

C.

A definitive statement about the sample data, not the population.

D.

A statement of no effect or no difference, which is assumed to be true until evidence indicates otherwise.

4 $What does a 95% confidence interval for a population mean suggest?$

Confidence Intervals Easy

A.

95% of the individual data points lie within this interval.

B.

There is a 95% probability that the sample mean is the true population mean.

C.

The experiment will yield the correct result 95% of the time.

D.

We are 95% confident that the true population mean lies within this interval.

5 $A correlation coefficient of -1.0 between two variables indicates...$

Correlation Easy

A.

A strong non-linear relationship.

B.

A perfect positive linear relationship.

C.

No linear relationship at all.

D.

A perfect negative linear relationship.

6 $What is the fundamental goal of Maximum Likelihood Estimation (MLE)?$

Maximum Likelihood Estimation Easy

A.

To calculate the p-value for a hypothesis test directly.

B.

To find the model parameters that maximize the probability of observing the given data.

C.

To find the most complex model possible to fit the data.

D.

To minimize the number of features used in a model.

7 $What is the primary role of a 'loss function' in training a machine learning model?$

Loss Functions Easy

A.

To measure the difference between the model's prediction and the actual true value.

B.

To select the most important features for the model.

C.

To measure the speed of the training process.

D.

To normalize the input data before training begins.

8 $Which statement best describes the core idea of the Gradient Descent algorithm?$

Gradient Descent Easy

A.

It randomly guesses model parameters until the loss is low enough.

B.

It calculates the final accuracy of the model on the test set.

C.

It increases the loss function to find the point of maximum error.

D.

It iteratively takes steps in the direction opposite to the gradient of the loss function to find a minimum.

9 $In Gradient Descent, what is a likely consequence of setting the learning rate too high?$

Learning Rate Impact Easy

A.

The gradient of the loss function will immediately become zero.

B.

The optimization may overshoot the minimum and fail to converge.

C.

The model will perfectly fit the training data in a single step.

D.

The training process will be extremely slow but very accurate.

10 $What is the 'global minimum' of a loss function?$

Global Minima Easy

A.

Any point where the function's gradient is equal to zero.

B.

A point that is a minimum within a local region, but not necessarily the lowest overall.

C.

The highest possible value of the function.

D.

The lowest possible value of the function across its entire domain.

11 $Which of the following best describes a 'local minimum'?$

Local Minima Easy

A.

A point where the function is flat, but is not a minimum or maximum (e.g., a saddle point).

B.

The starting point for the optimization algorithm.

C.

The single lowest point of the entire function.

D.

A point that has the lowest value within its immediate neighborhood, but might not be the lowest point overall.

12 $How does Stochastic Gradient Descent (SGD) primarily differ from standard (Batch) Gradient Descent?$

Stochastic Gradient Descent (SGD) Easy

A.

SGD calculates the gradient over the entire dataset for each update, making it more accurate.

B.

SGD always uses a much larger learning rate than Batch Gradient Descent.

C.

SGD updates the model parameters using only a single or a small batch of training examples at a time.

D.

SGD can only be used for linear regression models.

13 $A key advantage of optimizing a convex function is that...$

Convex vs Non-convex Functions Easy

A.

Any local minimum is also a global minimum.

B.

It does not require an optimization algorithm to be solved.

C.

It is always shaped like a straight line.

D.

It has many different local minima to choose from.

14 $In the context of a function's surface, what is a 'saddle point'?$

Saddle Points Easy

A.

A point where the gradient is zero, but it is neither a local minimum nor a local maximum.

B.

The absolute global minimum of the function.

C.

A point where the gradient is extremely large.

D.

The point on the surface where the learning rate is highest.

15 $What is the primary role of 'Momentum' when added to an optimizer like Gradient Descent?$

Advanced optimizers (Momentum) Easy

A.

To automatically decrease the learning rate to zero over time.

B.

To ensure the gradient of the loss function is always positive.

C.

To calculate the value of the loss function more accurately.

D.

To help accelerate convergence, especially through areas with small but consistent gradients.

16 $The Adam optimizer is considered an 'adaptive' learning rate method primarily because it...$

Advanced optimizers (Adam) Easy

A.

Uses a single, fixed learning rate for all parameters throughout training.

B.

Requires the user to manually adjust the learning rate every epoch.

C.

Only works on datasets that adapt over time.

D.

Computes individual, adaptive learning rates for different parameters.

17 $What is the fundamental goal of an 'optimization method' in the context of training a machine learning model?$

Optimization Methods Easy

A.

To perform feature engineering and data preprocessing.

B.

To evaluate the model's performance on unseen test data.

C.

To select the best machine learning algorithm for a given task.

D.

To find the set of model parameters that minimizes the loss function.

18 $For a regression problem where you are predicting a continuous value (e.g., a house price), which of the following is a common loss function?$

Loss Functions Easy

A.

Mean Squared Error (MSE)

B.

Hinge Loss

C.

Binary Cross-Entropy

D.

0-1 Loss

19 $If the correlation coefficient between variable A and variable B is 0, what can we conclude?$

Correlation Easy

A.

There is no linear relationship between the variables.

B.

The variables are completely independent of each other.

C.

One variable causes the other to change.

D.

There is a perfect negative relationship.

20 $What is a likely consequence of setting the learning rate too low in Gradient Descent?$

Learning Rate Impact Easy

A.

The algorithm will find the global minimum in a single step.

B.

The model will overfit the training data almost instantly.

C.

The algorithm will diverge and the loss will increase rapidly.

D.

The training process will be very slow and may get stuck before reaching the minimum.

21 $A machine learning engineer tests a new recommendation algorithm. The null hypothesis () is that the new algorithm's click-through rate (CTR) is the same as the old one. The alternative hypothesis () is that the new algorithm's CTR is higher. After running an A/B test, they calculate a p-value of 0.02. If their significance level () is 0.05, what is the most appropriate conclusion?$

Hypothesis Testing Medium

A.

Reduce the significance level to 0.01 to make the result insignificant.

B.

Reject the null hypothesis and conclude there is significant evidence that the new algorithm has a higher CTR.

C.

Accept the null hypothesis because the p-value is small.

D.

Fail to reject the null hypothesis and conclude there is no significant difference between the algorithms.

22 $What is the primary reason for the 'noisy' or fluctuating updates observed in the training loss curve when using Stochastic Gradient Descent (SGD) compared to batch Gradient Descent?$

Stochastic Gradient Descent (SGD) Medium

A.

SGD uses a much larger learning rate by default, causing it to overshoot the minimum.

B.

The gradient is calculated on a small subset (or single instance) of the data, which is not a perfect representation of the entire dataset's gradient.

C.

The noise is intentionally added to prevent the model from overfitting.

D.

SGD computes the exact gradient for the entire dataset, but applies the update randomly.

23 $The Momentum optimizer is designed to address which primary issue in standard Gradient Descent?$

Advanced optimizers (Momentum, RMSProp, Adam) Medium

A.

The need for a manually tuned learning rate by adapting it for each parameter.

B.

Vanishing gradients in deep networks by normalizing the updates.

C.

High variance gradients in ravines, by accelerating in the consistent direction and dampening oscillations.

D.

Getting stuck in saddle points by using second-order derivative information.

24 $You are given a dataset of 5 data points from a Gaussian distribution with known variance but unknown mean : . What is the Maximum Likelihood Estimate (MLE) for the mean ?$

Maximum Likelihood Estimation Medium

A.

4.0

B.

5.5

C.

5.0

D.

6.0

25 $During the training of a neural network, you observe that the loss value is consistently and rapidly increasing epoch after epoch. Which of the following is the most likely cause?$

Learning Rate Impact Medium

A.

The batch size is too small, making the gradient updates excessively noisy.

B.

The learning rate is too high, causing the optimizer to consistently overshoot the minimum and diverge.

C.

The model has too many parameters, leading to immediate overfitting.

D.

The learning rate is too low, causing the optimizer to get stuck in a local minimum.

26 $A 95% confidence interval for the average inference time of a model is calculated to be [45ms, 55ms]. Which statement is the correct interpretation of this interval?$

Confidence Intervals Medium

A.

We are 95% confident that the true population mean of the model's inference time lies between 45ms and 55ms.

B.

There is a 95% probability that the model's next inference will take between 45ms and 55ms.

C.

95% of all sample means for inference time will fall between 45ms and 55ms.

D.

The model is correct 95% of the time, and its inference time is between 45ms and 55ms.

27 $In a high-dimensional optimization landscape typical of deep learning, a critical point where the gradient is zero but some second derivatives are positive and others are negative is known as a:$

Local Minima, Global Minima, Saddle Points Medium

A.

Local Minimum

B.

Global Minimum

C.

Plateau

D.

Saddle Point

28 $The Adam optimizer can be seen as a combination of which two other optimization techniques?$

Advanced optimizers (Momentum, RMSProp, Adam) Medium

A.

Momentum and RMSProp

B.

Adagrad and RMSProp

C.

SGD and Nesterov Accelerated Gradient

D.

L-BFGS and Momentum

29 $Consider training a logistic regression model using the logistic loss (log loss) function on a linearly separable dataset. Why is Gradient Descent guaranteed to find the global minimum for this problem?$

Convex vs Non-convex Functions Medium

A.

Because the dataset is linearly separable.

B.

Because Gradient Descent always finds the global minimum regardless of the function.

C.

Because logistic regression has no local minima, only saddle points.

D.

Because the logistic loss function is a convex function.

30 $You are training a regression model to predict house prices, but your dataset contains some extreme outliers (e.g., multi-million dollar mansions). Which loss function would be more robust to these outliers compared to Mean Squared Error (MSE)?$

Loss Functions Medium

A.

Mean Absolute Error (MAE) or Huber Loss

B.

Mean Squared Error (MSE) with a larger learning rate

C.

Cross-Entropy Loss

D.

Hinge Loss

31 $A data scientist calculates the Pearson correlation coefficient between feature A and feature B and gets a value of -0.9. What does this value imply?$

Correlation Medium

A.

There is almost no linear relationship between the features.

B.

Feature A causes feature B to decrease.

C.

There is a strong negative linear relationship between feature A and feature B.

D.

There is a strong positive linear relationship between feature A and feature B.

32 $In batch Gradient Descent, how is the parameter update for the weights () calculated in each iteration?$

Gradient Descent Medium

A.

By computing the gradient of the loss function with respect to for a single, randomly chosen data point.

B.

By selecting a mini-batch of data and computing the average gradient of the loss function over that batch.

C.

By setting the gradient to zero and solving for analytically.

D.

By computing the average gradient of the loss function with respect to over the entire training dataset.

33 $A company wants to estimate the average satisfaction score of its 1,000,000 users. To do this efficiently, they survey 1,000 users. In this scenario, what do the 1,000,000 users and the 1,000 surveyed users represent, respectively?$

Sampling Medium

A.

Parameter and Statistic

B.

Population and Sample

C.

Sample and Population

D.

Statistic and Parameter

34 $An optimizer that adapts the learning rate for each parameter individually by dividing it by a running average of the magnitudes of recent gradients is likely to be:$

Advanced optimizers (Momentum, RMSProp, Adam) Medium

A.

Standard Gradient Descent

B.

RMSProp or Adagrad

C.

Newton's Method

D.

SGD with Momentum

35 $What does a Type I error represent in the context of A/B testing a new website design, where the null hypothesis is 'the new design has no effect on user engagement'?$

Hypothesis Testing Medium

A.

Concluding that the new design has an effect when it actually does not.

B.

Concluding that the new design has no effect when it actually does.

C.

Correctly concluding that the new design has an effect.

D.

The probability of the null hypothesis being true.

36 $You are using a learning rate scheduler that decreases the learning rate after a certain number of epochs. What is the primary motivation for this strategy?$

Learning Rate Impact Medium

A.

To ensure the model always converges to the global minimum in non-convex landscapes.

B.

To allow for fine-tuning and smaller steps as the optimizer gets closer to a minimum, preventing overshooting.

C.

To increase the noise in SGD for better exploration of the parameter space.

D.

To speed up the initial phase of training when the optimizer is far from the minimum.

37 $Why is it common to maximize the log-likelihood instead of the likelihood function directly in Maximum Likelihood Estimation (MLE)?$

Maximum Likelihood Estimation Medium

A.

It converts the product of probabilities into a sum, which is mathematically simpler and more numerically stable.

B.

The logarithm function is convex, guaranteeing a single global maximum.

C.

Log-likelihood provides a more accurate estimate of the true parameters.

D.

The likelihood function cannot be differentiated, but the log-likelihood can.

38 $When training a model with a very large dataset that does not fit into memory, which optimization algorithm is the most practical choice?$

Stochastic Gradient Descent (SGD) Medium

A.

Stochastic Gradient Descent (SGD) or Mini-batch Gradient Descent

B.

Batch Gradient Descent

C.

Newton's Method

D.

Grid Search

39 $What is the primary implication of a non-convex loss function for a gradient-based optimization algorithm?$

Local Minima, Global Minima, Saddle Points Medium

A.

The algorithm may converge to a local minimum that is not the global minimum.

B.

The learning rate must be increased exponentially to escape flat regions.

C.

The gradient will always be zero, so the algorithm cannot make progress.

D.

Convergence is guaranteed, but it will be much slower than for a convex function.

40 $Compared to first-order optimization methods like Gradient Descent, what is a major disadvantage of second-order methods like Newton's Method for training large neural networks?$

Optimization Methods Medium

A.

They always converge to saddle points instead of minima.

B.

They require a much smaller learning rate and converge more slowly.

C.

They cannot be used on non-convex functions.

D.

The high computational cost of computing and inverting the Hessian matrix.

41 $A data scientist runs 100 separate hypothesis tests on different features to see if they correlate with a target variable, using a significance level of . They find 5 features that are 'statistically significant'. Assuming the null hypothesis is true for all 100 tests (i.e., no real correlation exists), what is the probability of finding at least one significant result by chance?$

Hypothesis Testing Hard

A.

Approximately 99.4%

B.

Exactly 5%

C.

Approximately 63.4%

D.

It's impossible to determine without knowing the power of the tests.

42 $Consider a dataset drawn from a uniform distribution . What is the Maximum Likelihood Estimate (MLE) for the parameter ?$

Maximum Likelihood Estimation Hard

A.

The maximum value in the dataset,

B.

The sample mean,

C.

The median of the dataset.

D.

Twice the sample mean,

43 $In the Adam optimizer, the update rule involves bias-corrected first and second moment estimates, and . Why is this bias correction necessary, especially in the initial stages of training?$

Advanced optimizers (Momentum, RMSProp, Adam) Hard

A.

To normalize the gradients and make the algorithm less sensitive to the scale of the parameters.

B.

To prevent the learning rate from becoming too large at the beginning.

C.

Because the moment estimates and are initialized to zero and are therefore biased towards zero, especially during early iterations.

D.

To ensure the algorithm converges to a global minimum instead of a local minimum.

44 $Consider a function with a critical point at . The Hessian matrix at this point is . How would vanilla Gradient Descent (with a small learning rate) and Gradient Descent with Momentum behave when initialized very close to this point?$

Saddle Points Hard

A.

Vanilla GD will get stuck, while Momentum will oscillate with increasing amplitude and escape.

B.

Both will converge to the critical point and get stuck.

C.

Vanilla GD will move slowly along the direction, while Momentum will be repelled from the point entirely.

D.

Both will quickly escape the saddle point.

45 $You are training a model on a dataset with significant outliers. Which of the following loss functions is most robust to these outliers, and why?$

Loss Functions Hard

A.

Huber Loss, which is quadratic for small errors and linear for large errors, because it reduces the influence of large-error points.

B.

Mean Squared Error (MSE),, because it heavily penalizes large errors, forcing the model to correct for outliers.

C.

Mean Absolute Error (MAE),, because its gradient is constant, preventing outliers from having an exponentially large influence.

D.

Hinge Loss,, because it is only concerned with the classification boundary.

46 $In training a deep neural network, what is the primary reason that the noisy updates of Stochastic Gradient Descent (SGD) can be beneficial compared to the smooth updates of Batch Gradient Descent (BGD)?$

Stochastic Gradient Descent (SGD) Hard

A.

The noise ensures that the loss function decreases more rapidly and monotonically.

B.

The noise allows for the use of a much larger learning rate without the risk of divergence.

C.

The noise reduces the total computational cost per epoch, which is its only significant advantage.

D.

The noise has a regularizing effect and helps the optimizer escape sharp, poor-quality local minima in favor of flatter, more generalizable minima.

47 $A 95% confidence interval for the mean accuracy of a model is calculated to be [0.88, 0.92]. Which of the following is the incorrect interpretation of this interval?$

Confidence Intervals Hard

A.

We are 95% confident that the true mean accuracy of the model lies between 0.88 and 0.92.

B.

If we were to repeat the experiment many times, 95% of the calculated confidence intervals would contain the true mean accuracy.

C.

The interval [0.88, 0.92] is a plausible range for the true, unknown mean accuracy of the model.

D.

There is a 95% probability that the true mean accuracy of the model is within the interval [0.88, 0.92].

48 $Consider the function . Which statement accurately describes the convexity of this function?$

Convex vs Non-convex Functions Hard

A.

The function is a saddle function.

B.

The function is non-convex because its Hessian matrix is not positive semidefinite for all .

C.

The function is convex everywhere because its components () are convex.

D.

The function is convex only when and .

49 $You are training a model using Gradient Descent. The loss landscape is a long, narrow valley (an ill-conditioned problem). If you use a learning rate that is slightly too high, what is the most likely behavior you will observe in the parameter updates?$

Learning Rate Impact Hard

A.

The parameters will diverge immediately, with the loss increasing to infinity.

B.

The parameters will smoothly and quickly converge to the minimum at the bottom of the valley.

C.

The parameters will oscillate back and forth across the narrow walls of the valley while making very slow progress along the valley floor.

D.

The parameters will get stuck in a local minimum on the side of the valley.

50 $If the Pearson correlation coefficient between two variables X and Y is calculated to be exactly 0, which of the following statements is definitively true?$

Correlation Hard

A.

There is no linear relationship between X and Y.

B.

There is no relationship whatsoever between X and Y.

C.

The covariance between X and Y is positive.

D.

X and Y are statistically independent.

51 $For a loss function that is L-smooth and -strongly convex, what is the condition on the learning rate for standard Gradient Descent to guarantee convergence to the global minimum?$

Gradient Descent Hard

A.

B.

must be annealed according to a schedule like

C.

can be any positive constant.

D.

52 $What is a key difference between the update mechanisms of RMSProp and Momentum, and what is the implication of this difference?$

Advanced optimizers (Momentum, RMSProp, Adam) Hard

A.

Both are identical, but RMSProp includes a bias-correction term.

B.

Momentum adapts the learning rate for each parameter individually, while RMSProp uses a global learning rate.

C.

Momentum uses a moving average of past gradients, while RMSProp uses a moving average of past squared gradients. This makes RMSProp an adaptive learning rate method.

D.

Momentum is designed to accelerate progress along shallow ravines, while RMSProp is designed to prevent divergence on steep gradients.

53 $In the context of logistic regression, minimizing the negative log-likelihood is equivalent to minimizing which other common loss function?$

Maximum Likelihood Estimation Hard

A.

Binary Cross-Entropy Loss

B.

Mean Squared Error (MSE)

C.

Mean Absolute Error (MAE)

D.

Hinge Loss

54 $You are building a fraud detection model. The dataset has 99.9% non-fraudulent transactions and 0.1% fraudulent ones. If you use simple random sampling to create a training set, what is the most significant problem you will face, and what sampling technique is most appropriate to mitigate it?$

Sampling Hard

A.

Problem: High variance in model performance. Mitigation: Bootstrap Aggregating (Bagging).

B.

Problem: The computational cost of sampling will be too high. Mitigation: Simple Random Sampling with a smaller sample size.

C.

Problem: The model will likely have a high bias towards the majority class. Mitigation: Stratified Sampling.

D.

Problem: The sample will not be representative of the true population. Mitigation: Cluster Sampling.

55 $An A/B test is conducted to compare the click-through rates (CTR) of two website designs, A and B. The p-value for the difference in CTRs is 0.04. The 95% confidence interval for the difference (CTR_B - CTR_A) is [0.001, 0.009]. What can you conclude?$

Hypothesis Testing Hard

A.

Since the p-value is close to 0.05, the result is inconclusive and another test should be run.

B.

Design B is definitively better than Design A, and the change should be implemented immediately.

C.

The result is not statistically significant because the confidence interval contains values very close to zero.

D.

The result is statistically significant at, but the practical significance is questionable due to the small effect size.

56 $What is the primary motivation for using a learning rate schedule (e.g., annealing or decay) when training with Stochastic Gradient Descent (SGD)?$

Stochastic Gradient Descent (SGD) Hard

A.

To prevent the optimizer from ever diverging, regardless of the initial learning rate.

B.

To satisfy the Robbins-Monro conditions for convergence, which require the learning rate to decrease over time.

C.

To make SGD behave exactly like Batch Gradient Descent in the later stages of training.

D.

To increase the noise in the gradient updates, which helps in exploring the loss landscape.

57 $Let be a convex function and be another convex function. Which of the following operations is not guaranteed to result in a convex function?$

Convex vs Non-convex Functions Hard

A.

(Composition), assuming is also non-decreasing.

B.

(Pointwise maximum)

C.

(Product)

D.

(Sum)

58 $In very high-dimensional non-convex optimization problems, such as training deep neural networks, which type of critical point is empirically observed to be the most significant obstacle to optimization, and why?$

Local Minima, Global Minima, Saddle Points Hard

A.

Saddle points with very few negative curvature directions (high-index saddle points), because the gradient is small in almost all directions, slowing down first-order methods dramatically.

B.

High-quality local minima, because they are very close to the global minimum in loss but generalize poorly.

C.

Global minima, because they are computationally infeasible to find and often lead to overfitting.

D.

Saddle points with many negative curvature directions, because they are easy to escape.

59 $The Adam optimizer combines momentum (first moment estimate) and per-parameter scaling from RMSProp (second moment estimate). Consider a scenario where the gradient for a particular weight is consistently large and has low variance. How will Adam's update for this weight compare to the updates from Momentum and RMSProp individually?$

Advanced optimizers (Momentum, RMSProp, Adam) Hard

A.

Adam's update will be smaller than both Momentum's and RMSProp's because the large, consistent gradient will lead to a very large denominator ().

B.

Adam's update step size will be progressively dampened more than RMSProp's because the first moment also grows large.

C.

Adam's update will be larger than both Momentum's and RMSProp's.

D.

Adam's update will be similar to Momentum's, as the low variance makes the RMSProp component negligible.

60 $Why are second-order optimization methods like Newton's method not commonly used for training large deep learning models, despite their faster theoretical convergence rates?$

Optimization Methods Hard

A.

They require manual tuning of more hyperparameters than first-order methods.

B.

They are not compatible with non-convex loss landscapes and always diverge.

C.

They cannot be used with mini-batching and require the full dataset for every update.

D.

Their per-iteration computational cost, involving the inversion of the Hessian matrix, is prohibitively expensive for models with millions of parameters.

Unit 5 - Practice Quiz