1Which of the following statements best describes the relationship between a population and a sample in statistics?
A.A sample is the entire set of items under study, while a population is a subset.
B.A population is the entire set of items under study, while a sample is a subset selected for analysis.
C.Population and sample are disjoint sets with no overlapping elements.
D.A sample is always infinite, while a population is finite.
Correct Answer: A population is the entire set of items under study, while a sample is a subset selected for analysis.
Explanation:In statistics, the population refers to the total set of observations that can be made, while a sample is a specific subset of data drawn from the population to estimate its characteristics.
Incorrect! Try again.
2In Hypothesis Testing, what does the p-value represent?
A.The probability that the null hypothesis is true.
B.The probability of observing the data (or something more extreme) assuming the null hypothesis is true.
C.The probability that the alternative hypothesis is true.
D.The probability of making a Type I error.
Correct Answer: The probability of observing the data (or something more extreme) assuming the null hypothesis is true.
Explanation:The p-value quantifies the evidence against the null hypothesis. It is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
Incorrect! Try again.
3Which of the following defines a Type I error?
A.Accepting the null hypothesis when it is false.
B.Rejecting the null hypothesis when it is true.
C.Rejecting the alternative hypothesis when it is true.
D.Failing to reject the null hypothesis due to insufficient data.
Correct Answer: Rejecting the null hypothesis when it is true.
Explanation:A Type I error (False Positive) occurs when the null hypothesis () is true, but we incorrectly reject it.
Incorrect! Try again.
4What does a 95% Confidence Interval for a mean imply?
A.There is a 95% probability that the true population mean lies within the specific calculated interval.
B.95% of the data points lie within this interval.
C.If we were to take 100 different samples and compute a confidence interval for each, approximately 95 of the intervals would contain the true population mean.
D.The sample mean is 95% accurate.
Correct Answer: If we were to take 100 different samples and compute a confidence interval for each, approximately 95 of the intervals would contain the true population mean.
Explanation:Confidence intervals refer to the long-run frequency. A 95% confidence level means that if the sampling method is repeated many times, 95% of the calculated intervals will encompass the true population parameter.
Explanation:The Pearson correlation coefficient ranges from (perfect negative correlation) to (perfect positive correlation), with $0$ indicating no linear correlation.
Incorrect! Try again.
6Which principle is the basis of Maximum Likelihood Estimation (MLE)?
A.Minimizing the variance of the estimator.
B.Choosing parameters that maximize the probability of the observed data.
C.Minimizing the sum of squared errors.
D.Maximizing the entropy of the distribution.
Correct Answer: Choosing parameters that maximize the probability of the observed data.
Explanation:MLE seeks to find the parameter values that maximize the likelihood function , which represents the probability of observing the given data given parameters .
Incorrect! Try again.
7When deriving MLE, why do we often maximize the Log-Likelihood instead of the Likelihood?
A.Log-Likelihood transforms products into sums, simplifying differentiation.
B.Log-Likelihood is always positive.
C.Log-Likelihood changes the location of the maximum.
D.The Likelihood function is not differentiable.
Correct Answer: Log-Likelihood transforms products into sums, simplifying differentiation.
Explanation:Since probabilities are often products of individual observations (assuming I.I.D), taking the log turns these products into sums (), which are much easier to differentiate for optimization without changing the position of the maximum.
Incorrect! Try again.
8In the context of Machine Learning, what is a Loss Function?
A.A function that calculates the accuracy of the model.
B.A function that quantifies the difference between the predicted output and the actual target.
C.A function used to increase the learning rate.
D.A function that estimates the likelihood of the data.
Correct Answer: A function that quantifies the difference between the predicted output and the actual target.
Explanation:A Loss Function (or cost function) measures how far the model's predictions are from the actual labels. The goal of training is to minimize this value.
Incorrect! Try again.
9Which of the following loss functions is most commonly used for Binary Classification problems?
A.Mean Squared Error (MSE)
B.Mean Absolute Error (MAE)
C.Binary Cross-Entropy (Log Loss)
D.Hinge Loss
Correct Answer: Binary Cross-Entropy (Log Loss)
Explanation:Binary Cross-Entropy is the standard loss function for binary classification as it heavily penalizes confident but wrong predictions based on probability.
Incorrect! Try again.
10Mathematically, a function is convex if for any two points and :
A.
B.
C.
D.
Correct Answer:
Explanation:This is the definition of a convex function. Geometrically, it means the line segment connecting any two points on the graph of the function lies above or on the graph.
Incorrect! Try again.
11What is a key property of Convex Optimization problems?
A.They have multiple local minima.
B.Any local minimum is also a global minimum.
C.They are impossible to solve using Gradient Descent.
D.They always have a saddle point.
Correct Answer: Any local minimum is also a global minimum.
Explanation:In convex functions, there are no separate local minima; if a minimum is found, it is guaranteed to be the global minimum, making optimization reliable.
Incorrect! Try again.
12What happens if the Learning Rate in Gradient Descent is set too high?
A.The model converges very slowly.
B.The model gets stuck in a local minimum.
C.The algorithm may overshoot the minimum and diverge.
D.The loss function becomes convex.
Correct Answer: The algorithm may overshoot the minimum and diverge.
Explanation:A learning rate that is too large causes the update steps to be too big, potentially bouncing back and forth across the valley of the loss function and eventually diverging (moving away from the minimum).
Incorrect! Try again.
13In the context of optimization surfaces, what is a Saddle Point?
A.A point where the gradient is maximum.
B.A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
C.A point representing the global minimum.
D.A point where the function is undefined.
Correct Answer: A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
Explanation:A saddle point is a critical point where the gradient is zero (stationary point), but the Hessian matrix has both positive and negative eigenvalues, indicating it curves up in some directions and down in others.
Incorrect! Try again.
14Which optimization algorithm updates parameters using the gradient calculated from the entire dataset at each step?
A.Stochastic Gradient Descent (SGD)
B.Mini-batch Gradient Descent
C.Batch Gradient Descent
D.Adam
Correct Answer: Batch Gradient Descent
Explanation:Batch Gradient Descent computes the gradient of the cost function with respect to the parameters for the entire training dataset to perform a single update.
Incorrect! Try again.
15What is the primary advantage of Stochastic Gradient Descent (SGD) over Batch Gradient Descent?
A.It is computationally faster per update and can handle large datasets better.
B.It guarantees convergence to the global minimum for non-convex functions.
C.It has no noise in the gradient estimation.
D.It requires more memory.
Correct Answer: It is computationally faster per update and can handle large datasets better.
Explanation:SGD updates weights using only one sample at a time. This makes individual steps very fast and memory-efficient, allowing it to work with datasets that don't fit in RAM, although the path to convergence is noisy.
Incorrect! Try again.
16What is the formula for the weight update in standard Gradient Descent?
A.
B.
C.
D.
Correct Answer:
Explanation:We subtract the gradient (scaled by learning rate ) from the current parameters because the gradient points in the direction of the steepest increase, and we want to minimize the loss.
Incorrect! Try again.
17How does Momentum help in Gradient Descent?
A.It decreases the learning rate over time.
B.It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
C.It resets the weights to zero every epoch.
D.It calculates the second-order derivative.
Correct Answer: It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
Explanation:Momentum accumulates a moving average of past gradients. This helps dampen oscillations (e.g., in ravines) and accelerates updates in directions where the gradient consistently points.
Incorrect! Try again.
18Which optimizer utilizes the concept of adaptive learning rates by dividing the learning rate by the square root of the sum of accumulated squared gradients?
A.Momentum
B.SGD
C.Adagrad
D.Batch Gradient Descent
Correct Answer: Adagrad
Explanation:Adagrad adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller updates for frequent parameters by accumulating the sum of squared gradients in the denominator.
Incorrect! Try again.
19What issue with Adagrad does RMSProp attempt to fix?
A.The vanishing gradient problem in RNNs.
B.The learning rate decaying to zero too rapidly.
C.The high computational cost of the Hessian.
D.The oscillation of the loss function.
Correct Answer: The learning rate decaying to zero too rapidly.
Explanation:Adagrad accumulates squared gradients from the beginning of training, causing the learning rate to shrink continuously until it becomes infinitesimally small. RMSProp uses a decaying average of squared gradients to keep the learning rate viable.
Incorrect! Try again.
20The Adam optimizer combines the concepts of which two other optimizers?
A.SGD and Batch Gradient Descent
B.Momentum and RMSProp
C.Adagrad and Adadelta
D.Momentum and SGD
Correct Answer: Momentum and RMSProp
Explanation:Adam (Adaptive Moment Estimation) keeps track of an exponentially decaying average of past gradients (like Momentum) and exponentially decaying average of past squared gradients (like RMSProp).
Incorrect! Try again.
21What is a Local Minimum in the context of a loss function?
A.The lowest point in the entire domain of the function.
B.A point where the function value is lower than at all nearby points, but not necessarily the lowest overall.
C.A point where the gradient is infinite.
D.A point where the loss is exactly zero.
Correct Answer: A point where the function value is lower than at all nearby points, but not necessarily the lowest overall.
Explanation:A local minimum is a valley in the landscape that is lower than its immediate surroundings. In non-convex functions, an optimizer might get stuck here instead of finding the global minimum.
Incorrect! Try again.
22If the gradient of the loss function is Zero, the point could be:
A.Only a Global Minimum.
B.Only a Local Minimum.
C.A Local Maximum, Local Minimum, or Saddle Point.
D.An inflection point with non-zero slope.
Correct Answer: A Local Maximum, Local Minimum, or Saddle Point.
Explanation:A zero gradient indicates a stationary point. Without checking the curvature (second derivative/Hessian), it could be a peak (maximum), a valley (minimum), or a saddle point.
Incorrect! Try again.
23Which sampling method involves dividing the population into subgroups and then taking a random sample from each subgroup?
A.Simple Random Sampling
B.Stratified Sampling
C.Cluster Sampling
D.Convenience Sampling
Correct Answer: Stratified Sampling
Explanation:Stratified sampling ensures that specific subgroups (strata) of the population are represented adequately by sampling randomly within each stratum.
Incorrect! Try again.
24The Mean Squared Error (MSE) is calculated as:
A.
B.
C.
D.
Correct Answer:
Explanation:MSE measures the average of the squares of the errors, which is the average squared difference between the estimated values () and the actual value ().
Incorrect! Try again.
25What is the primary role of the Gradient vector in optimization?
A.It points in the direction of the greatest rate of decrease of the function.
B.It points in the direction of the greatest rate of increase of the function.
C.It indicates the value of the loss function.
D.It determines the curvature of the function.
Correct Answer: It points in the direction of the greatest rate of increase of the function.
Explanation:The gradient vector points in the direction of the steepest ascent. That is why in Gradient Descent, we move in the direction of the negative gradient.
Incorrect! Try again.
26Which of the following functions is Non-Convex?
A.
B.
C. over
D.
Correct Answer: over
Explanation:The sine function oscillates between peaks and troughs. A line connecting a peak to another peak will pass below the trough, violating the definition of convexity.
Incorrect! Try again.
27In Mini-batch Gradient Descent, if the batch size is equal to 1, it becomes:
A.Batch Gradient Descent
B.Stochastic Gradient Descent (SGD)
C.Momentum Optimization
D.Adam Optimization
Correct Answer: Stochastic Gradient Descent (SGD)
Explanation:SGD uses a single training example per iteration, which corresponds to a mini-batch size of 1.
Incorrect! Try again.
28What is Bias Correction in the context of the Adam optimizer?
A.Adding a bias term to the neural network weights.
B.Correcting the moment estimates because they are initialized to zero and biased towards zero at the start of training.
C.Adjusting the dataset to have equal class distribution.
D.Removing outliers from the training data.
Correct Answer: Correcting the moment estimates because they are initialized to zero and biased towards zero at the start of training.
Explanation:Since the moving averages (moments) in Adam are initialized to vectors of 0s, they are biased towards zero, especially during the initial time steps. Bias correction scales these terms to counteract this effect.
Incorrect! Try again.
29A Learning Rate Schedule (or Decay) is used to:
A.Increase the learning rate as training progresses to jump out of minima.
B.Keep the learning rate constant throughout training.
C.Decrease the learning rate over time to allow fine-grained convergence near the minimum.
D.Randomize the learning rate at every epoch.
Correct Answer: Decrease the learning rate over time to allow fine-grained convergence near the minimum.
Explanation:Large steps are useful early in training, but as the model approaches the minimum, smaller steps are needed to settle into the optimal point without oscillating, hence decaying the learning rate.
Incorrect! Try again.
30Which statement is true regarding Sampling Error?
A.It is the error caused by observing a sample instead of the whole population.
B.It is the error caused by incorrect data entry.
C.It can be eliminated completely by using Stratified Sampling.
D.It refers to the bias introduced by the researcher.
Correct Answer: It is the error caused by observing a sample instead of the whole population.
Explanation:Sampling error is the natural variation that results from using a subset of the population to estimate parameters, rather than the entire population.
Incorrect! Try again.
31If two variables have a Correlation Coefficient of 0, it implies:
A.They are completely independent.
B.There is no linear relationship between them.
C.One variable causes the other.
D.There is a strong non-linear relationship.
Correct Answer: There is no linear relationship between them.
Explanation:Pearson correlation only measures linear relationships. Variables could still be dependent in a non-linear way (e.g., over a symmetric interval) and have a correlation of 0.
Incorrect! Try again.
32The assumption that data points are I.I.D stands for:
A.Independent and Identically Distributed
B.Integrated and Inverse Dependent
C.Independent and Inverse Distributed
D.Identical and Implicitly Dependent
Correct Answer: Independent and Identically Distributed
Explanation:I.I.D is a fundamental assumption in ML and statistics, meaning each data point is drawn from the same probability distribution and is independent of the others.
Incorrect! Try again.
33In the context of MLE, if we assume the errors are Gaussian distributed, minimizing the Mean Squared Error is equivalent to:
A.Maximizing the Likelihood.
B.Minimizing the Likelihood.
C.Maximizing the Variance.
D.Minimizing the Log-Likelihood of a Bernoulli distribution.
Correct Answer: Maximizing the Likelihood.
Explanation:For a Gaussian distribution, the log-likelihood function involves a negative sum of squared errors term. Therefore, maximizing the likelihood is mathematically equivalent to minimizing the sum of squared errors.
Incorrect! Try again.
34Which of the following is a potential solution to escaping a Local Minimum?
A.Setting the learning rate to 0.
B.Using a strictly convex loss function.
C.Using SGD or adding momentum.
D.Using Batch Gradient Descent with a small learning rate.
Correct Answer: Using SGD or adding momentum.
Explanation:The noise inherent in Stochastic Gradient Descent and the velocity accumulated by Momentum can provide the energy needed to jump out of shallow local minima or traverse saddle points.
Incorrect! Try again.
35The Hessian Matrix provides information about:
A.The slope of the function.
B.The curvature of the function.
C.The global minimum directly.
D.The learning rate.
Correct Answer: The curvature of the function.
Explanation:The Hessian is a square matrix of second-order partial derivatives. It describes the local curvature of a function of many variables.
Incorrect! Try again.
36When training a model, Underfitting (high bias) generally implies:
A.The model is too complex and captures noise.
B.The model is too simple to capture the underlying structure of the data.
C.The loss function is non-convex.
D.The learning rate is too high.
Correct Answer: The model is too simple to capture the underlying structure of the data.
Explanation:Underfitting occurs when the model cannot capture the relationship between inputs and outputs, leading to poor performance on both training and test data.
Incorrect! Try again.
37What is the Central Limit Theorem?
A.The distribution of sample means approximates a normal distribution as the sample size becomes larger, regardless of the population's distribution.
B.The mean of the sample is always equal to the mean of the population.
C.All data distributions eventually become Gaussian over time.
D.The variance of the sample increases as sample size increases.
Correct Answer: The distribution of sample means approximates a normal distribution as the sample size becomes larger, regardless of the population's distribution.
Explanation:CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets large ( usually), even if the population distribution is not normal.
Incorrect! Try again.
38In Nesterov Accelerated Gradient (NAG), how is the gradient computed differently from standard Momentum?
A.It computes the gradient at the current position.
B.It computes the gradient at the predicted future position (lookahead).
C.It squares the gradient.
D.It ignores the gradient and uses only velocity.
Correct Answer: It computes the gradient at the predicted future position (lookahead).
Explanation:NAG first makes a big jump in the direction of the previous accumulated gradient (momentum step), and then measures the gradient at that 'lookahead' position to make a correction.
Incorrect! Try again.
39Which loss function is robust to outliers?
A.Mean Squared Error (MSE)
B.Mean Absolute Error (MAE)
C.Exponential Loss
D.Log-Cosh Loss (approximated as MSE)
Correct Answer: Mean Absolute Error (MAE)
Explanation:MSE squares the error, so large errors (outliers) have a disproportionately large impact. MAE takes the absolute difference, treating outliers linearly, making it more robust.
Incorrect! Try again.
40The Null Hypothesis () usually states:
A.There is a significant effect or relationship.
B.There is no significant effect or relationship (status quo).
C.The sample size is too small.
D.The alternative hypothesis is false.
Correct Answer: There is no significant effect or relationship (status quo).
Explanation:The null hypothesis typically proposes that no statistical significance exists in a set of given observations (e.g., no difference between means, no correlation).
Incorrect! Try again.
41What effect does a Low Learning Rate generally have on training?
A.Rapid convergence but high risk of overshooting.
B.Divergence of the loss function.
C.Slow convergence but precise estimation of the minimum.
D.It causes the gradient to vanish immediately.
Correct Answer: Slow convergence but precise estimation of the minimum.
Explanation:A small learning rate takes tiny steps. This ensures the algorithm doesn't miss the minimum, but it requires many more epochs to reach convergence.
Incorrect! Try again.
42In the context of SGD, what is an Epoch?
A.One update of the weights.
B.One complete pass through the entire training dataset.
C.Processing one mini-batch.
D.Reaching the global minimum.
Correct Answer: One complete pass through the entire training dataset.
Explanation:An epoch is defined as one complete cycle through the full training dataset. In SGD, this consists of weight updates, where is the number of samples.
Incorrect! Try again.
43Which of the following is an example of Unbiased Estimator?
A.An estimator whose expected value equals the true population parameter.
B.An estimator that always underestimates the parameter.
C.An estimator with the smallest possible variance.
D.An estimator calculated from a non-random sample.
Correct Answer: An estimator whose expected value equals the true population parameter.
Explanation:Bias in statistics is the difference between the expected value of an estimator and the true value of the parameter. If the difference is zero, it is unbiased.
Incorrect! Try again.
44Why are Saddle Points problematic for optimization in high dimensions?
A.They are rare in high-dimensional spaces.
B.They are much more common than local minima and gradients near them are very small (plateaus), slowing down learning.
C.They represent the global maximum.
D.They cause the learning rate to explode.
Correct Answer: They are much more common than local minima and gradients near them are very small (plateaus), slowing down learning.
Explanation:In high-dimensional non-convex optimization (like Neural Networks), saddle points are far more frequent than local minima. The flat regions (plateaus) around saddle points cause gradients to be near zero, significantly stalling standard GD.
Incorrect! Try again.
45The term in the Adam optimizer typically controls:
A.The learning rate decay.
B.The exponential decay rate for the first moment estimates (momentum).
C.The exponential decay rate for the second moment estimates (RMSProp part).
D.The epsilon value for numerical stability.
Correct Answer: The exponential decay rate for the first moment estimates (momentum).
Explanation:In Adam, (typically 0.9) controls the decay rate of the moving average of the gradient (the first moment).
Incorrect! Try again.
46For a Linear Regression problem, the loss function surface is:
A.Convex (Bowl-shaped).
B.Non-convex (Many local minima).
C.Flat.
D.Saddle-shaped.
Correct Answer: Convex (Bowl-shaped).
Explanation:Linear regression using Mean Squared Error results in a quadratic loss function, which is strictly convex. It has a single global minimum.
Incorrect! Try again.
47Which of the following creates a trade-off in Mini-batch size selection?
Explanation:Large batches allow parallel hardware (GPUs) to work efficiently but may converge to sharp minima. Small batches are noisier (good for exploration) but computationally less efficient per epoch.
Incorrect! Try again.
48Statistical Significance is usually determined by comparing the p-value to:
A.The Correlation Coefficient.
B.The Significance Level (), commonly 0.05.
C.The sample mean.
D.The variance.
Correct Answer: The Significance Level (), commonly 0.05.
Explanation:If the p-value is less than the chosen significance level (e.g., 0.05), the result is deemed statistically significant, and the null hypothesis is rejected.
Incorrect! Try again.
49What is the primary motivation for using RMSProp over standard Gradient Descent?
A.To simplify the math.
B.To adapt the learning rate for each parameter, dealing with sparse data or varying gradients.
C.To ensure the loss function becomes convex.
D.To remove the need for a derivative.
Correct Answer: To adapt the learning rate for each parameter, dealing with sparse data or varying gradients.
Explanation:RMSProp divides the learning rate by a running average of the magnitudes of recent gradients. This dampens the step size for parameters with high gradients (oscillations) and increases it for those with small gradients.
Incorrect! Try again.
50Given a population with variance , the standard error of the sample mean ( samples) is:
A.
B.
C.
D.
Correct Answer:
Explanation:The Standard Error of the Mean (SEM) quantifies how much the sample mean is expected to fluctuate from the true population mean. It decreases as the square root of the sample size increases.