1

Define the terms Population and Sample in the context of machine learning statistics. Briefly explain two common sampling techniques.

2

Explain the concept of Hypothesis Testing. Describe the roles of the Null Hypothesis ( $H_0$ ), Alternative Hypothesis ( $H_1$ ), and the p-value.

3

Derive the Maximum Likelihood Estimator (MLE) for the parameter $\theta$ (probability of heads) in a Bernoulli distribution (coin flip) given a sequence of outcomes.

4

Differentiate between Type I and Type II errors in statistical hypothesis testing.

Type I Error (False Positive):

Definition: Occurs when the Null Hypothesis ( $H_0$ ) is true, but we incorrectly reject it.
Probability: Denoted by $\alpha$ (Significance level).
Example: Diagnosing a healthy patient with a disease.

Type II Error (False Negative):

Definition: Occurs when the Null Hypothesis ( $H_0$ ) is false (Alternative is true), but we fail to reject $H_0$ .
Probability: Denoted by $\beta$ .
Example: Failing to detect a disease in a sick patient.

	$H_0$ is True	$H_0$ is False
Reject $H_0$	Type I Error ( $\alpha$ )	Correct Decision
Fail to Reject $H_0$	Correct Decision	Type II Error ( $\beta$ )

5

What is a Confidence Interval? How is it interpreted in the context of estimating a population mean?

6

Compare Mean Squared Error (MSE) and Cross-Entropy Loss. When should each be used?

7

Explain the difference between Convex and Non-convex functions. Why is this distinction important in optimization?

8

Define Local Minima, Global Minima, and Saddle Points.

9

Discuss the impact of the Learning Rate ( $\eta$ ) on the convergence of Gradient Descent.

10

Derive the Gradient Descent update rule mathematically. What is the role of the gradient?

11

Compare Batch Gradient Descent, Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent.

12

What is the Correlation Coefficient? How does it differ from Covariance?

13

Explain the concept of Momentum in optimization. How does it help SGD?

14

Describe the RMSProp optimizer. How does it address the issue of diminishing learning rates in Adagrad?

15

Explain the Adam (Adaptive Moment Estimation) optimizer. Why is it considered one of the best choices for training neural networks?

16

What is the Maximum Likelihood Principle? Explain with an example why maximizing the likelihood is equivalent to minimizing the Negative Log-Likelihood (NLL).

17

Derive the Maximum Likelihood Estimators for the mean ( $\mu$ ) and variance ( $\sigma^2$ ) of a Gaussian (Normal) Distribution.

18

Explain the concept of Stratified Sampling and why it is crucial in classification problems with imbalanced datasets.

19

Why do we use Log-Likelihood instead of Likelihood in Maximum Likelihood Estimation (MLE)? Give two reasons.

20

What are the limitations of Standard Gradient Descent (Batch GD) that led to the development of optimizers like SGD, Momentum, and Adam?

Unit5 - Subjective Questions