Unit5 - Subjective Questions
CSE273 • Practice Questions with Detailed Answers
Define the terms Population and Sample in the context of machine learning statistics. Briefly explain two common sampling techniques.
Population vs. Sample:
- Population: The entire group of data points or individuals that you are interested in studying. In ML, this represents the theoretical distribution of all possible data (e.g., all images of cats in existence).
- Sample: A subset of the population selected for analysis. In ML, the training and test datasets are samples from the real-world population.
Sampling Techniques:
- Simple Random Sampling: Every member of the population has an equal probability of being selected. This reduces selection bias.
- Stratified Sampling: The population is divided into subgroups (strata) based on shared characteristics (e.g., classes in a classification problem), and samples are taken from each stratum to ensure representation.
Explain the concept of Hypothesis Testing. Describe the roles of the Null Hypothesis (), Alternative Hypothesis (), and the p-value.
Hypothesis Testing is a statistical method used to make inferences or decisions about a population parameter based on sample data.
- Null Hypothesis (): The default assumption that there is no effect, no relationship, or no difference between groups. Ideally, we try to reject this.
- Alternative Hypothesis ( or ): The statement that contradicts the null hypothesis, representing the effect or difference we wish to prove.
- p-value: The probability of observing test results at least as extreme as the results actually observed, assuming that the null hypothesis is true.
- If p-value (significance level, typically 0.05), we reject .
- If p-value , we fail to reject .
Derive the Maximum Likelihood Estimator (MLE) for the parameter (probability of heads) in a Bernoulli distribution (coin flip) given a sequence of outcomes.
Consider a Bernoulli distribution where and .
1. Likelihood Function:
For independent trials with outcomes :
2. Log-Likelihood:
Taking the log makes differentiation easier:
Let (total heads). Then .
3. Differentiate w.r.t and set to 0:
4. Solve for :
The MLE is simply the sample mean (proportion of heads).
Differentiate between Type I and Type II errors in statistical hypothesis testing.
Type I Error (False Positive):
- Definition: Occurs when the Null Hypothesis () is true, but we incorrectly reject it.
- Probability: Denoted by (Significance level).
- Example: Diagnosing a healthy patient with a disease.
Type II Error (False Negative):
- Definition: Occurs when the Null Hypothesis () is false (Alternative is true), but we fail to reject .
- Probability: Denoted by .
- Example: Failing to detect a disease in a sick patient.
| is True | is False | |
|---|---|---|
| Reject | Type I Error () | Correct Decision |
| Fail to Reject | Correct Decision | Type II Error () |
What is a Confidence Interval? How is it interpreted in the context of estimating a population mean?
Confidence Interval (CI):
A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.
Formula (for large samples):
Where is sample mean, is the Z-score (e.g., 1.96 for 95%), is standard deviation, and is sample size.
Interpretation:
A 95% Confidence Interval means that if we were to take 100 different samples and compute a confidence interval for each, approximately 95 of those intervals would contain the true population mean. It does not mean there is a 95% probability the specific interval contains the mean (the parameter is fixed, the interval varies).
Compare Mean Squared Error (MSE) and Cross-Entropy Loss. When should each be used?
Mean Squared Error (MSE):
- Formula:
- Usage: Used primarily for Regression problems where the output is a continuous value.
- Properties: Penalizes large errors significantly (due to squaring). Assumes Gaussian noise in MLE context.
Cross-Entropy Loss (Log Loss):
- Formula (Binary):
- Usage: Used primarily for Classification problems (Binary or Multi-class) where outputs are probabilities (0 to 1).
- Properties: Penalizes confident wrong predictions heavily. Convex for logistic regression.
Explain the difference between Convex and Non-convex functions. Why is this distinction important in optimization?
Convex Function:
- Definition: A function is convex if a line segment connecting any two points on the graph of the function lies above or on the graph.
- Minima: Has only one global minimum (no local minima). Gradient descent is guaranteed to converge to the global optimum.
- Example: .
Non-convex Function:
- Definition: A function that is not convex; it has "hills and valleys."
- Minima: Can have multiple local minima, saddle points, and one global minimum.
- Example: Neural loss landscapes.
Importance:
Optimization is much harder for non-convex functions because algorithms like Gradient Descent can get stuck in a local minimum or a saddle point rather than finding the best possible solution (global minimum).
Define Local Minima, Global Minima, and Saddle Points.
- Global Minimum: The point in the entire domain of the function where the function value is the lowest.
- Local Minimum: A point where the function value is lower than all valid surrounding points in a specific neighborhood, but not necessarily the lowest in the entire domain.
- Saddle Point: A point where the gradient is zero (stationary point), but it is not an extremum (neither a pure minimum nor maximum). In some directions, the function curves up (like a minimum), and in others, it curves down (like a maximum). Saddle points effectively slow down training in high-dimensional non-convex optimization.
Discuss the impact of the Learning Rate () on the convergence of Gradient Descent.
The Learning Rate determines the step size at each iteration while moving toward a minimum.
- Too Small:
- Pros: Precise; likely to reach the minimum eventually.
- Cons: Training is extremely slow. It may get stuck in local minima more easily.
- Too Large:
- Pros: Faster initial movement.
- Cons: Can overshoot the minimum. It may oscillate or even diverge (loss increases to infinity).
- Optimal:
- Ideally, the learning rate should decay over time (Learning Rate Scheduling) to take large steps initially and small, precise steps near the convergence point.
Derive the Gradient Descent update rule mathematically. What is the role of the gradient?
Goal: Minimize a cost function parameterized by .
Taylor Series Approximation:
Around a point , the function can be approximated as:
To minimize , we want , implying $
abla J(\theta)^T \Delta \theta < 0$.
To maximize the decrease, we choose in the direction opposite to the gradient:
Where is the learning rate (step size).
Update Rule:
Role of the Gradient:
The gradient vector $
abla J(\theta)$ points in the direction of the steepest ascent. Therefore, subtracting the gradient moves the parameters in the direction of the steepest descent to minimize loss.
Compare Batch Gradient Descent, Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent.
1. Batch Gradient Descent:
- Mechanism: Uses the entire dataset to compute the gradient for one update step.
- Pros: Stable convergence; smooth error curve.
- Cons: Computationally very expensive for large datasets; requires large memory.
2. Stochastic Gradient Descent (SGD):
- Mechanism: Uses a single random training example () to compute the gradient and update parameters.
- Pros: Frequent updates; faster per iteration; can escape local minima due to noise.
- Cons: High variance in updates; objective function fluctuates heavily (noisy convergence).
3. Mini-batch Gradient Descent:
- Mechanism: Uses a small batch of samples (e.g., 32, 64) for each update.
- Pros: Best of both worlds—vectorization efficiency (like Batch) and faster convergence (like SGD). Standard in Deep Learning.
What is the Correlation Coefficient? How does it differ from Covariance?
Correlation Coefficient (Pearson's ):
A normalized statistical measure that calculates the strength and direction of the linear relationship between two variables.
- Range: .
- : Perfect positive linear relationship.
- : Perfect negative linear relationship.
- $0$: No linear relationship.
Difference from Covariance:
- Covariance indicates the direction of the linear relationship (positive or negative) but is scale-dependent. If you multiply variable by 100, the covariance increases, making interpretation difficult.
- Correlation is the standardized version of covariance. It is unitless and scale-invariant, making it easier to compare relationships across different datasets.
Explain the concept of Momentum in optimization. How does it help SGD?
Concept:
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It accumulates an exponentially decaying moving average of past gradients.
Update Rule:
Let be the velocity (accumulated gradient) at time :
Where (usually 0.9) is the momentum term.
How it helps:
- Ravines: In areas where the surface curves steeply in one dimension but typically in another (ravines), SGD oscillates. Momentum adds the history of past updates, cancelling out the oscillations and boosting the velocity in the direction of the minimum.
- It helps the optimizer "roll past" small local minima.
Describe the RMSProp optimizer. How does it address the issue of diminishing learning rates in Adagrad?
RMSProp (Root Mean Square Propagation):
RMSProp is an adaptive learning rate method designed to resolve Adagrad's radically diminishing learning rates.
Mechanism:
Instead of accumulating the sum of squared gradients from the beginning (like Adagrad), RMSProp uses an exponentially decaying average of squared gradients.
Equations:
- Compute gradient: $g_t =
abla J(\theta_t)$ - Update moving average of squared gradients:
- Update parameters:
Benefit:
This limits the influence of early gradients. The denominator does not grow monotonically, allowing the learning rate to adapt based on the recent magnitude of gradients (large gradients smaller step, small gradients larger step) without vanishing entirely.
Explain the Adam (Adaptive Moment Estimation) optimizer. Why is it considered one of the best choices for training neural networks?
Adam combines the advantages of Momentum (handling oscillations) and RMSProp (adaptive learning rates).
Mechanism:
It computes adaptive learning rates for each parameter by maintaining estimates of the first moment (mean) and the second moment (uncentered variance) of the gradients.
Steps:
- First Moment (Momentum):
- Second Moment (RMSProp):
- Bias Correction: Since and are initialized to 0, they are biased toward 0 initially.
- Update Rule:
Why it is effective:
- It adapts learning rates for every parameter.
- It handles sparse gradients well.
- It includes momentum to speed up convergence.
- Bias correction ensures stability at the start of training.
What is the Maximum Likelihood Principle? Explain with an example why maximizing the likelihood is equivalent to minimizing the Negative Log-Likelihood (NLL).
Maximum Likelihood Principle:
A method for estimating the parameters of a statistical model. It selects the parameter values that maximize the probability (likelihood) of observing the given sample data.
Minimizing Negative Log-Likelihood:
- The likelihood is a product of probabilities (usually small numbers ), leading to numerical underflow.
- Taking the logarithm turns products into sums: , which is numerically stable.
- The log function is monotonically increasing, so maximizing is equivalent to maximizing .
- Optimization algorithms (like Gradient Descent) are designed to minimize functions.
- Therefore, maximizing is mathematically equivalent to minimizing (Negative Log-Likelihood).
Derive the Maximum Likelihood Estimators for the mean () and variance () of a Gaussian (Normal) Distribution.
Likelihood Function:
For i.i.d samples from :
Log-Likelihood:
1. MLE for :
Differentiate w.r.t and set to 0:
2. MLE for :
Let . Differentiate w.r.t :
Multiply by :
Explain the concept of Stratified Sampling and why it is crucial in classification problems with imbalanced datasets.
Stratified Sampling:
A sampling method where the population is divided into homogeneous subgroups called strata, and random samples are then drawn from each stratum independently.
Importance in Imbalanced Datasets:
Imagine a dataset with 95% Class A and 5% Class B.
- If we use Simple Random Sampling, we might accidentally select a sample that contains only Class A, leaving the model unable to learn anything about Class B.
- Stratified Sampling ensures that the proportion of Class A and Class B in the training/test sets matches the proportion in the original population (e.g., forcing exactly 5% of the sample to be Class B).
- This guarantees that minority classes are represented, preventing bias towards the majority class.
Why do we use Log-Likelihood instead of Likelihood in Maximum Likelihood Estimation (MLE)? Give two reasons.
We prefer Log-Likelihood () over raw Likelihood () for the following reasons:
-
Numerical Stability:
Likelihood is the product of many probabilities (). Since probabilities are , multiplying many of them results in extremely small numbers that can cause floating-point underflow in computers (rounding to 0). Logarithms convert this product into a sum (), which deals with manageable number ranges. -
Ease of Differentiation:
To find the maximum, we calculate derivatives. Differentiating a long product (using the product rule) is mathematically complex and computationally expensive. Differentiating a sum (resulting from the log) is linear and much simpler to solve analytically or computationally.
What are the limitations of Standard Gradient Descent (Batch GD) that led to the development of optimizers like SGD, Momentum, and Adam?
Limitations of Batch Gradient Descent:
- Computational Efficiency: Batch GD requires calculating the gradient for the entire dataset before making a single update step. For millions of data points, this is extremely slow and memory-intensive.
- Local Minima & Saddle Points: Without noise, Batch GD follows the exact gradient. In non-convex surfaces, it can easily get stuck in a local minimum or a saddle point and stop learning.
- Fixed Learning Rate: It uses a global learning rate for all parameters. If the data is sparse or features have different frequencies, a single learning rate is inefficient (too slow for some parameters, too fast for others).
- No Oscillation Dampening: In areas like ravines (steep in one direction, flat in another), standard GD tends to oscillate across the slopes rather than moving down the valley floor efficiently.