Unit 5 - Practice Quiz

CSE273 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 Which of the following statements best describes the relationship between a population and a sample in statistics?

A. A sample is the entire set of items under study, while a population is a subset.
B. A population is the entire set of items under study, while a sample is a subset selected for analysis.
C. Population and sample are disjoint sets with no overlapping elements.
D. A sample is always infinite, while a population is finite.

2 In Hypothesis Testing, what does the p-value represent?

A. The probability that the null hypothesis is true.
B. The probability of observing the data (or something more extreme) assuming the null hypothesis is true.
C. The probability that the alternative hypothesis is true.
D. The probability of making a Type I error.

3 Which of the following defines a Type I error?

A. Accepting the null hypothesis when it is false.
B. Rejecting the null hypothesis when it is true.
C. Rejecting the alternative hypothesis when it is true.
D. Failing to reject the null hypothesis due to insufficient data.

4 What does a 95% Confidence Interval for a mean imply?

A. There is a 95% probability that the true population mean lies within the specific calculated interval.
B. 95% of the data points lie within this interval.
C. If we were to take 100 different samples and compute a confidence interval for each, approximately 95 of the intervals would contain the true population mean.
D. The sample mean is 95% accurate.

5 The Pearson Correlation Coefficient () ranges between:

A. $0$ and $1$
B. and $1$
C. and
D. and $0$

6 Which principle is the basis of Maximum Likelihood Estimation (MLE)?

A. Minimizing the variance of the estimator.
B. Choosing parameters that maximize the probability of the observed data.
C. Minimizing the sum of squared errors.
D. Maximizing the entropy of the distribution.

7 When deriving MLE, why do we often maximize the Log-Likelihood instead of the Likelihood?

A. Log-Likelihood transforms products into sums, simplifying differentiation.
B. Log-Likelihood is always positive.
C. Log-Likelihood changes the location of the maximum.
D. The Likelihood function is not differentiable.

8 In the context of Machine Learning, what is a Loss Function?

A. A function that calculates the accuracy of the model.
B. A function that quantifies the difference between the predicted output and the actual target.
C. A function used to increase the learning rate.
D. A function that estimates the likelihood of the data.

9 Which of the following loss functions is most commonly used for Binary Classification problems?

A. Mean Squared Error (MSE)
B. Mean Absolute Error (MAE)
C. Binary Cross-Entropy (Log Loss)
D. Hinge Loss

10 Mathematically, a function is convex if for any two points and :

A.
B.
C.
D.

11 What is a key property of Convex Optimization problems?

A. They have multiple local minima.
B. Any local minimum is also a global minimum.
C. They are impossible to solve using Gradient Descent.
D. They always have a saddle point.

12 What happens if the Learning Rate in Gradient Descent is set too high?

A. The model converges very slowly.
B. The model gets stuck in a local minimum.
C. The algorithm may overshoot the minimum and diverge.
D. The loss function becomes convex.

13 In the context of optimization surfaces, what is a Saddle Point?

A. A point where the gradient is maximum.
B. A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
C. A point representing the global minimum.
D. A point where the function is undefined.

14 Which optimization algorithm updates parameters using the gradient calculated from the entire dataset at each step?

A. Stochastic Gradient Descent (SGD)
B. Mini-batch Gradient Descent
C. Batch Gradient Descent
D. Adam

15 What is the primary advantage of Stochastic Gradient Descent (SGD) over Batch Gradient Descent?

A. It is computationally faster per update and can handle large datasets better.
B. It guarantees convergence to the global minimum for non-convex functions.
C. It has no noise in the gradient estimation.
D. It requires more memory.

16 What is the formula for the weight update in standard Gradient Descent?

A.
B.
C.
D.

17 How does Momentum help in Gradient Descent?

A. It decreases the learning rate over time.
B. It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
C. It resets the weights to zero every epoch.
D. It calculates the second-order derivative.

18 Which optimizer utilizes the concept of adaptive learning rates by dividing the learning rate by the square root of the sum of accumulated squared gradients?

A. Momentum
B. SGD
C. Adagrad
D. Batch Gradient Descent

19 What issue with Adagrad does RMSProp attempt to fix?

A. The vanishing gradient problem in RNNs.
B. The learning rate decaying to zero too rapidly.
C. The high computational cost of the Hessian.
D. The oscillation of the loss function.

20 The Adam optimizer combines the concepts of which two other optimizers?

A. SGD and Batch Gradient Descent
B. Momentum and RMSProp
C. Adagrad and Adadelta
D. Momentum and SGD

21 What is a Local Minimum in the context of a loss function?

A. The lowest point in the entire domain of the function.
B. A point where the function value is lower than at all nearby points, but not necessarily the lowest overall.
C. A point where the gradient is infinite.
D. A point where the loss is exactly zero.

22 If the gradient of the loss function is Zero, the point could be:

A. Only a Global Minimum.
B. Only a Local Minimum.
C. A Local Maximum, Local Minimum, or Saddle Point.
D. An inflection point with non-zero slope.

23 Which sampling method involves dividing the population into subgroups and then taking a random sample from each subgroup?

A. Simple Random Sampling
B. Stratified Sampling
C. Cluster Sampling
D. Convenience Sampling

24 The Mean Squared Error (MSE) is calculated as:

A.
B.
C.
D.

25 What is the primary role of the Gradient vector in optimization?

A. It points in the direction of the greatest rate of decrease of the function.
B. It points in the direction of the greatest rate of increase of the function.
C. It indicates the value of the loss function.
D. It determines the curvature of the function.

26 Which of the following functions is Non-Convex?

A.
B.
C. over
D.

27 In Mini-batch Gradient Descent, if the batch size is equal to 1, it becomes:

A. Batch Gradient Descent
B. Stochastic Gradient Descent (SGD)
C. Momentum Optimization
D. Adam Optimization

28 What is Bias Correction in the context of the Adam optimizer?

A. Adding a bias term to the neural network weights.
B. Correcting the moment estimates because they are initialized to zero and biased towards zero at the start of training.
C. Adjusting the dataset to have equal class distribution.
D. Removing outliers from the training data.

29 A Learning Rate Schedule (or Decay) is used to:

A. Increase the learning rate as training progresses to jump out of minima.
B. Keep the learning rate constant throughout training.
C. Decrease the learning rate over time to allow fine-grained convergence near the minimum.
D. Randomize the learning rate at every epoch.

30 Which statement is true regarding Sampling Error?

A. It is the error caused by observing a sample instead of the whole population.
B. It is the error caused by incorrect data entry.
C. It can be eliminated completely by using Stratified Sampling.
D. It refers to the bias introduced by the researcher.

31 If two variables have a Correlation Coefficient of 0, it implies:

A. They are completely independent.
B. There is no linear relationship between them.
C. One variable causes the other.
D. There is a strong non-linear relationship.

32 The assumption that data points are I.I.D stands for:

A. Independent and Identically Distributed
B. Integrated and Inverse Dependent
C. Independent and Inverse Distributed
D. Identical and Implicitly Dependent

33 In the context of MLE, if we assume the errors are Gaussian distributed, minimizing the Mean Squared Error is equivalent to:

A. Maximizing the Likelihood.
B. Minimizing the Likelihood.
C. Maximizing the Variance.
D. Minimizing the Log-Likelihood of a Bernoulli distribution.

34 Which of the following is a potential solution to escaping a Local Minimum?

A. Setting the learning rate to 0.
B. Using a strictly convex loss function.
C. Using SGD or adding momentum.
D. Using Batch Gradient Descent with a small learning rate.

35 The Hessian Matrix provides information about:

A. The slope of the function.
B. The curvature of the function.
C. The global minimum directly.
D. The learning rate.

36 When training a model, Underfitting (high bias) generally implies:

A. The model is too complex and captures noise.
B. The model is too simple to capture the underlying structure of the data.
C. The loss function is non-convex.
D. The learning rate is too high.

37 What is the Central Limit Theorem?

A. The distribution of sample means approximates a normal distribution as the sample size becomes larger, regardless of the population's distribution.
B. The mean of the sample is always equal to the mean of the population.
C. All data distributions eventually become Gaussian over time.
D. The variance of the sample increases as sample size increases.

38 In Nesterov Accelerated Gradient (NAG), how is the gradient computed differently from standard Momentum?

A. It computes the gradient at the current position.
B. It computes the gradient at the predicted future position (lookahead).
C. It squares the gradient.
D. It ignores the gradient and uses only velocity.

39 Which loss function is robust to outliers?

A. Mean Squared Error (MSE)
B. Mean Absolute Error (MAE)
C. Exponential Loss
D. Log-Cosh Loss (approximated as MSE)

40 The Null Hypothesis () usually states:

A. There is a significant effect or relationship.
B. There is no significant effect or relationship (status quo).
C. The sample size is too small.
D. The alternative hypothesis is false.

41 What effect does a Low Learning Rate generally have on training?

A. Rapid convergence but high risk of overshooting.
B. Divergence of the loss function.
C. Slow convergence but precise estimation of the minimum.
D. It causes the gradient to vanish immediately.

42 In the context of SGD, what is an Epoch?

A. One update of the weights.
B. One complete pass through the entire training dataset.
C. Processing one mini-batch.
D. Reaching the global minimum.

43 Which of the following is an example of Unbiased Estimator?

A. An estimator whose expected value equals the true population parameter.
B. An estimator that always underestimates the parameter.
C. An estimator with the smallest possible variance.
D. An estimator calculated from a non-random sample.

44 Why are Saddle Points problematic for optimization in high dimensions?

A. They are rare in high-dimensional spaces.
B. They are much more common than local minima and gradients near them are very small (plateaus), slowing down learning.
C. They represent the global maximum.
D. They cause the learning rate to explode.

45 The term in the Adam optimizer typically controls:

A. The learning rate decay.
B. The exponential decay rate for the first moment estimates (momentum).
C. The exponential decay rate for the second moment estimates (RMSProp part).
D. The epsilon value for numerical stability.

46 For a Linear Regression problem, the loss function surface is:

A. Convex (Bowl-shaped).
B. Non-convex (Many local minima).
C. Flat.
D. Saddle-shaped.

47 Which of the following creates a trade-off in Mini-batch size selection?

A. Accuracy vs Precision.
B. Computational efficiency (larger batches exploit matrix ops) vs Convergence speed/stability (smaller batches introduce beneficial noise).
C. Training error vs Test error.
D. Bias vs Variance.

48 Statistical Significance is usually determined by comparing the p-value to:

A. The Correlation Coefficient.
B. The Significance Level (), commonly 0.05.
C. The sample mean.
D. The variance.

49 What is the primary motivation for using RMSProp over standard Gradient Descent?

A. To simplify the math.
B. To adapt the learning rate for each parameter, dealing with sparse data or varying gradients.
C. To ensure the loss function becomes convex.
D. To remove the need for a derivative.

50 Given a population with variance , the standard error of the sample mean ( samples) is:

A.
B.
C.
D.