1In machine learning, what does a random variable represent?
Random variables in ML models
Easy
A.A variable whose value is a numerical outcome of a random phenomenon.
B.The name of the machine learning algorithm.
C.A fixed hyperparameter of a model, like the learning rate.
D.A variable that is always zero.
Correct Answer: A variable whose value is a numerical outcome of a random phenomenon.
Explanation:
A random variable is a way to map outcomes from a random process (like selecting a data point) to numbers. For example, the height of a randomly selected person is a random variable.
Incorrect! Try again.
2In a dataset for classifying handwritten digits (0-9), the digit's label is an example of a...
Random variables in ML models
Easy
A.Negative random variable
B.Discrete random variable
C.Continuous random variable
D.Constant variable
Correct Answer: Discrete random variable
Explanation:
The variable can only take on a finite set of specific values (0, 1, 2, ..., 9). This is the definition of a discrete random variable.
Incorrect! Try again.
3Which of the following is a clear example of a continuous random variable in a machine learning model?
Random variables in ML models
Easy
A.The number of 'likes' on a social media post.
B.Whether an email is 'spam' or 'not spam'.
C.The star rating (1 to 5) of a product.
D.The temperature in Celsius predicted by a weather model.
Correct Answer: The temperature in Celsius predicted by a weather model.
Explanation:
Temperature can take any value within a given range (e.g., 25.5°C, 25.51°C), making it continuous. The other options can only take on specific, distinct integer values.
Incorrect! Try again.
4A random variable provides a numerical summary of a...
Random variables in ML models
Easy
A.Dataset's size
B.Fixed algorithm
C.Model's complexity
D.Random outcome
Correct Answer: Random outcome
Explanation:
The core purpose of a random variable is to assign a number to each possible outcome of a random experiment or observation, making it easier to analyze mathematically.
Incorrect! Try again.
5Which probability distribution is commonly used to model binary outcomes, such as whether a customer will click on an ad or not?
Probability distributions used in learning algorithms
Easy
A.Poisson distribution
B.Gaussian (Normal) distribution
C.Bernoulli distribution
D.Exponential distribution
Correct Answer: Bernoulli distribution
Explanation:
The Bernoulli distribution models a single trial with two possible outcomes (success/failure, 1/0), which is perfect for binary classification tasks.
Incorrect! Try again.
6The Gaussian distribution, often called the 'bell curve', is primarily defined by which two parameters?
Probability distributions used in learning algorithms
Easy
A.Mean and variance
B.Minimum and maximum
C.Rate and time
D.Number of trials and probability of success
Correct Answer: Mean and variance
Explanation:
The mean () determines the center of the bell curve, and the variance () determines its spread or width.
Incorrect! Try again.
7If you are modeling the number of typographical errors on a page of a book, which discrete probability distribution is most appropriate?
Probability distributions used in learning algorithms
Easy
A.Gaussian distribution
B.Uniform distribution
C.Bernoulli distribution
D.Poisson distribution
Correct Answer: Poisson distribution
Explanation:
The Poisson distribution is ideal for modeling the number of times an event occurs in a fixed interval of time or space, such as errors on a page.
Incorrect! Try again.
8What does a probability distribution fundamentally describe?
Probability distributions used in learning algorithms
Easy
A.The exact value a variable will take.
B.The total number of data points in a set.
C.The probability of each possible outcome of a random variable.
D.The speed of a learning algorithm.
Correct Answer: The probability of each possible outcome of a random variable.
Explanation:
A probability distribution is a function that provides the probabilities of occurrence for all different possible outcomes of a random variable.
Incorrect! Try again.
9What is the primary goal of the Maximum Likelihood Estimation (MLE) method?
Likelihood estimation in ML context
Easy
A.To prove that a model is 100% correct.
B.To calculate the prior probability of the parameters.
C.To minimize the number of features in the dataset.
D.To find the model parameters that maximize the probability of observing the given data.
Correct Answer: To find the model parameters that maximize the probability of observing the given data.
Explanation:
MLE works by finding the set of parameters for a model that makes the observed data 'most likely' or 'most probable' to have occurred.
Incorrect! Try again.
10The likelihood function is the probability of the observed data treated as a function of the...
Likelihood estimation in ML context
Easy
A.Prior probability
B.Parameters
C.Data
D.Loss function
Correct Answer: Parameters
Explanation:
Likelihood, , is the probability of the data given the parameters. When we view this as a function of for fixed data , it is called the likelihood function.
Incorrect! Try again.
11Why is the log-likelihood often maximized instead of the likelihood itself?
Likelihood estimation in ML context
Easy
A.It is the only way to handle data with negative values.
B.It is a requirement for all regression models.
C.It gives a completely different and better result.
D.It is mathematically more convenient, turning products into sums.
Correct Answer: It is mathematically more convenient, turning products into sums.
Explanation:
The logarithm is a monotonic function, so maximizing is the same as maximizing . The log converts products of probabilities into sums, which are easier to differentiate and numerically more stable.
Incorrect! Try again.
12In a linear regression model, assuming the errors are normally distributed, MLE is equivalent to minimizing which loss function?
Likelihood estimation in ML context
Easy
A.Absolute Error Loss
B.Logistic Loss
C.Hinge Loss
D.Squared Error Loss
Correct Answer: Squared Error Loss
Explanation:
Maximizing the likelihood of the data under the assumption of Gaussian noise leads to the same solution as minimizing the sum of squared errors. This provides a probabilistic justification for using least squares.
Incorrect! Try again.
13The Mean Squared Error (MSE) loss function is most suitable for which type of machine learning problem?
Loss functions: squared error, logistic loss
Easy
A.Reinforcement Learning
B.Clustering
C.Classification
D.Regression
Correct Answer: Regression
Explanation:
MSE is used in regression tasks where the goal is to predict a continuous numerical value, as it measures the average squared difference between predicted and actual values.
Incorrect! Try again.
14If a model predicts a house price of $300,000 and the actual price is $310,000, what is the squared error for this single prediction?
Loss functions: squared error, logistic loss
Easy
A.$300,000$
B.
C.$10,000$
D.$100,000,000$
Correct Answer: $100,000,000$
Explanation:
The squared error is calculated as . Here, it is .
Incorrect! Try again.
15Logistic Loss, also known as Binary Cross-Entropy, is primarily used in which type of task?
Loss functions: squared error, logistic loss
Easy
A.Linear Regression
B.Anomaly Detection
C.Time Series Forecasting
D.Binary Classification
Correct Answer: Binary Classification
Explanation:
Logistic Loss is designed for classification problems where the model outputs a probability (between 0 and 1) for a class. It penalizes confident but incorrect predictions heavily.
Incorrect! Try again.
16What is the fundamental role of a loss function during the training of a machine learning model?
Loss functions: squared error, logistic loss
Easy
A.To quantify the model's error, which the training process tries to minimize.
B.To count the number of data points in the training set.
C.To select which algorithm to use.
D.To preprocess and clean the input data.
Correct Answer: To quantify the model's error, which the training process tries to minimize.
Explanation:
A loss function provides a measure of how well the model's predictions match the true labels. The training algorithm (e.g., gradient descent) adjusts the model's parameters to make this loss value as small as possible.
Incorrect! Try again.
17In the Bayesian formula , what does the term represent?
Bayesian interpretation of learning models
Easy
A.The Prior: our belief about the parameters before seeing data .
B.The Evidence: the probability of the data.
C.The Likelihood: the probability of the data given the parameters.
D.The Posterior: our belief about the parameters after seeing data .
Correct Answer: The Prior: our belief about the parameters before seeing data .
Explanation:
is the prior probability distribution. It captures our initial beliefs about what the model parameters might be before we have observed any evidence or data.
Incorrect! Try again.
18What is the 'posterior' probability in the context of Bayesian learning?
Bayesian interpretation of learning models
Easy
A.The initial guess about a hypothesis before seeing any data.
B.The updated probability of a hypothesis after considering new evidence.
C.The probability that the evidence is correct.
D.A type of loss function.
Correct Answer: The updated probability of a hypothesis after considering new evidence.
Explanation:
The posterior probability is the result of applying Bayes' theorem. It combines the prior belief with the likelihood of the new data to form an updated, more informed belief.
Incorrect! Try again.
19A key difference between the Bayesian and frequentist approaches is that the Bayesian approach treats model parameters as...
Bayesian interpretation of learning models
Easy
A.Unnecessary for prediction
B.Always positive numbers
C.Random variables
D.Fixed constants
Correct Answer: Random variables
Explanation:
In Bayesian statistics, parameters are considered random variables that have their own probability distributions (the prior and posterior). In contrast, frequentist statistics treats parameters as fixed, unknown constants.
Incorrect! Try again.
20What does Maximum a Posteriori (MAP) estimation do?
Bayesian interpretation of learning models
Easy
A.It finds the parameter values that maximize only the likelihood.
B.It finds the parameter values that maximize the posterior probability.
C.It calculates the mean of the posterior distribution.
D.It finds the parameter values that maximize only the prior.
Correct Answer: It finds the parameter values that maximize the posterior probability.
Explanation:
MAP is a Bayesian alternative to MLE. Instead of just maximizing the likelihood , it maximizes the posterior probability , which incorporates information from the prior belief .
Incorrect! Try again.
21In a standard linear regression model , which components are treated as random variables from a modeling perspective?
Random variables in ML models
Medium
A.Only the target variable
B.The feature matrix and the parameters
C.The target variable and the error term
D.Only the parameters
Correct Answer: The target variable and the error term
Explanation:
The error term is explicitly modeled as a random variable (e.g., from a Gaussian distribution). Because the target is a function of , it is also a random variable. The features are often treated as fixed, and the parameters are unknown constants that we aim to estimate.
Incorrect! Try again.
22Which probability distribution is most suitable for modeling the class labels in a multi-class classification problem with classes for a single data point?
Probability distributions used in learning algorithms
Medium
A.Categorical distribution
B.Poisson distribution
C.Binomial distribution
D.Gaussian distribution
Correct Answer: Categorical distribution
Explanation:
The Categorical distribution is a generalization of the Bernoulli distribution for a single trial with possible outcomes. It is the natural choice for modeling a single data point's label in a multi-class classification scenario.
Incorrect! Try again.
23Assuming a dataset of i.i.d. data points and a model that gives , the likelihood function is defined as:
Likelihood estimation in ML context
Medium
A. without the i.i.d. assumption
B.
C.
D.
Correct Answer:
Explanation:
The likelihood of the parameters given the data is the probability of observing that data given the parameters. For independent and identically distributed (i.i.d.) data, this is the product of the probabilities of each individual data point.
Incorrect! Try again.
24A binary classifier predicts a probability of for a data point whose true label is . What is the logistic loss (cross-entropy) for this single prediction?
Loss functions: squared error, logistic loss
Medium
A.
B.
C.
D.$0.2$
Correct Answer:
Explanation:
The logistic loss for a single instance is given by . Since the true label , the formula simplifies to .
Incorrect! Try again.
25In a Bayesian framework, what is the relationship between the posterior, prior, and likelihood?
Bayesian interpretation of learning models
Medium
A.Prior Likelihood Posterior
B.Posterior Likelihood + Prior
C.Likelihood Posterior Prior
D.Posterior Likelihood Prior
Correct Answer: Posterior Likelihood Prior
Explanation:
This is the essence of Bayes' theorem applied to model parameters and data : . Since is a constant with respect to , we can write , which translates to Posterior is proportional to Likelihood times Prior.
Incorrect! Try again.
26You are building a model to predict the number of customer support emails your company will receive in a one-hour period. Which probability distribution is a common and suitable choice for this task?
Probability distributions used in learning algorithms
Medium
A.Gaussian distribution
B.Poisson distribution
C.Uniform distribution
D.Bernoulli distribution
Correct Answer: Poisson distribution
Explanation:
The Poisson distribution is used to model the number of events occurring within a fixed interval of time or space, given a known constant mean rate. This perfectly describes the scenario of counting emails per hour.
Incorrect! Try again.
27Why is squared error loss generally not a good choice for classification problems?
Loss functions: squared error, logistic loss
Medium
A.Its derivative is always zero for classification.
B.It cannot be optimized using gradient descent.
C.It assumes the target variable is continuous.
D.It penalizes confident, incorrect predictions too little.
Correct Answer: It penalizes confident, incorrect predictions too little.
Explanation:
For a binary class {0, 1}, if the model predicts 0.1 for a true label of 1, the squared error is . If it predicts -1 (very wrong), the error is . In contrast, log loss would approach infinity as the prediction gets more confidently wrong, providing a much stronger penalty and a better learning signal.
Incorrect! Try again.
28Maximizing the log-likelihood is equivalent to maximizing the likelihood itself. What is the primary practical advantage of optimizing the log-likelihood?
Likelihood estimation in ML context
Medium
A.It converts products into sums, which are numerically more stable and analytically simpler.
B.It always results in a convex optimization problem.
C.It incorporates a prior belief about the parameters.
D.It is computationally faster to compute a single logarithm than a single product.
Correct Answer: It converts products into sums, which are numerically more stable and analytically simpler.
Explanation:
The likelihood function involves a product of probabilities, which can lead to numerical underflow for large datasets. The logarithm turns this product into a sum (), which is much more stable and easier to differentiate for optimization.
Incorrect! Try again.
29When we say a machine learning model provides a 'probabilistic prediction', what does this imply about its output?
Random variables in ML models
Medium
A.The model uses a random number generator to make predictions.
B.The model's output is guaranteed to be correct with a certain probability.
C.The model's output is a probability distribution over possible outcomes, not just a single point estimate.
D.The model's parameters are updated randomly during training.
Correct Answer: The model's output is a probability distribution over possible outcomes, not just a single point estimate.
Explanation:
A probabilistic prediction treats the output as a random variable and provides its probability distribution. For example, instead of predicting a single class 'cat', it might output {cat: 0.9, dog: 0.08, other: 0.02}.
Incorrect! Try again.
30What is the primary conceptual difference between Maximum a Posteriori (MAP) and Maximum Likelihood Estimation (MLE)?
Bayesian interpretation of learning models
Medium
A.MAP incorporates a prior belief about the model parameters, while MLE does not.
B.MAP is a Bayesian method, while MLE is a frequentist method that cannot be interpreted probabilistically.
C.MLE is used for regression while MAP is used for classification.
D.MAP maximizes the probability of the data given the parameters, while MLE maximizes the posterior.
Correct Answer: MAP incorporates a prior belief about the model parameters, while MLE does not.
Explanation:
MAP estimation aims to find the parameters that maximize the posterior probability, , which is proportional to the likelihood times the prior . MLE only maximizes the likelihood , implicitly assuming a uniform (uninformative) prior.
Incorrect! Try again.
31If we assume that the target variable in a regression problem follows a Gaussian distribution with mean and constant variance , maximizing the likelihood of the model parameters is equivalent to minimizing which loss function?
Likelihood estimation in ML context
Medium
A.Sum of Absolute Errors:
B.Sum of Squared Errors:
C.Logistic Loss
D.Hinge Loss
Correct Answer: Sum of Squared Errors:
Explanation:
The log-likelihood under the Gaussian assumption is . To maximize this, we must minimize the term , which is the sum of squared errors.
Incorrect! Try again.
32In logistic regression, the sigmoid function is used to model the probability of the positive class. This probability is the parameter of which underlying probability distribution for the binary target variable?
Probability distributions used in learning algorithms
Medium
A.Bernoulli distribution
B.Binomial distribution
C.Categorical distribution
D.Gaussian distribution
Correct Answer: Bernoulli distribution
Explanation:
Logistic regression models the outcome of a single binary event (e.g., class 0 or 1). The Bernoulli distribution describes the probability of success (p) or failure (1-p) in a single trial, making it the perfect fit. The sigmoid function's output is used as the parameter for this distribution.
Incorrect! Try again.
33A regression model predicts a value of 150 for a data point with a true value of 100. Another model predicts 200 for a true value of 150. How do their squared error losses compare?
Loss functions: squared error, logistic loss
Medium
A.The loss is the same for both predictions.
B.The second prediction has a higher loss.
C.The comparison is impossible without knowing the model.
D.The first prediction has a higher loss.
Correct Answer: The loss is the same for both predictions.
Explanation:
The squared error loss depends only on the difference between the true and predicted values. In both cases, the absolute error is and . The squared error is for both.
Incorrect! Try again.
34Performing MAP estimation with a Gaussian prior on the model weights is equivalent to performing MLE with which type of regularization?
Bayesian interpretation of learning models
Medium
A.L2 Regularization (Ridge)
B.No regularization
C.L1 Regularization (Lasso)
D.Dropout
Correct Answer: L2 Regularization (Ridge)
Explanation:
The log of a Gaussian prior on weights is proportional to . When this is added to the log-likelihood term during MAP optimization, it becomes the L2 regularization penalty term. Thus, MAP with a Gaussian prior is equivalent to L2-regularized MLE.
Incorrect! Try again.
35In a classification model, the output for a given input is a vector of probabilities for classes A, B, and C respectively. This vector can be interpreted as the parameters of which random variable?
Random variables in ML models
Medium
A.A set of three independent Bernoulli random variables
B.A Gaussian random variable representing the prediction error
C.A Binomial random variable representing the count of correct predictions
D.A Categorical random variable representing the predicted class
Correct Answer: A Categorical random variable representing the predicted class
Explanation:
The output vector represents the probability mass function for a single trial with multiple possible outcomes (classes). This is precisely the definition of a Categorical random variable.
Incorrect! Try again.
36You have a biased coin where the probability of heads, , is unknown. You flip it 5 times and observe the sequence H, T, H, H, T. What is the Maximum Likelihood Estimate (MLE) for ?
Likelihood estimation in ML context
Medium
A.It cannot be determined from this data.
B.
C.
D.
Correct Answer:
Explanation:
The likelihood function is . To maximize this, we can take the derivative with respect to , set it to zero, and solve. The result is the sample proportion of heads, which is 3 heads out of 5 flips, or .
Incorrect! Try again.
37Consider two regression models. Model A has a Root Mean Squared Error (RMSE) of 10. Model B has a Mean Absolute Error (MAE) of 10. What can we definitively conclude?
Loss functions: squared error, logistic loss
Medium
A.We cannot directly compare the models as they use different error metrics.
B.Model A is better than Model B.
C.Both models have the same predictive accuracy.
D.Model B is better than Model A.
Correct Answer: We cannot directly compare the models as they use different error metrics.
Explanation:
RMSE and MAE are on the same scale but penalize errors differently. RMSE (related to squared error) penalizes large errors more heavily than MAE. A model could have a lower MAE but a higher RMSE if it has many small errors and a few very large ones. Therefore, without more information, a direct comparison is not meaningful.
Incorrect! Try again.
38A key assumption of the Naive Bayes algorithm is that the features are conditionally independent given the class label. How does this assumption simplify the model's probability calculations?
Probability distributions used in learning algorithms
Medium
A.It allows the joint probability of features to be calculated as the product of individual probabilities .
B.It eliminates the need to calculate the likelihood.
C.It forces all features to follow a Gaussian distribution.
D.It ensures that the posterior probability is always greater than the prior.
Correct Answer: It allows the joint probability of features to be calculated as the product of individual probabilities .
Explanation:
Without the independence assumption, calculating the joint probability would be very complex. The 'naive' assumption of conditional independence simplifies this to a product of individual feature probabilities, making the computation tractable.
Incorrect! Try again.
39Under what condition does the Maximum a Posteriori (MAP) estimate for a parameter become exactly the same as the Maximum Likelihood Estimate (MLE)?
Bayesian interpretation of learning models
Medium
A.When the prior distribution is a uniform distribution.
B.When the likelihood function is Gaussian.
C.When the dataset is very small.
D.When the posterior distribution is symmetric.
Correct Answer: When the prior distribution is a uniform distribution.
Explanation:
MAP maximizes . MLE maximizes . If the prior is a uniform distribution, it is a constant. Maximizing is the same as maximizing alone. Therefore, MAP becomes equivalent to MLE.
Incorrect! Try again.
40The logistic loss function for binary classification is derived from which principle?
Loss functions: squared error, logistic loss
Medium
A.Minimizing the absolute error of the predicted probabilities.
B.Minimizing the squared distance between the prediction and the true label.
C.Finding the maximum margin hyperplane between classes.
D.Maximizing the likelihood of the data under a Bernoulli distribution assumption.
Correct Answer: Maximizing the likelihood of the data under a Bernoulli distribution assumption.
Explanation:
If you assume the binary target variable follows a Bernoulli distribution whose parameter is given by the sigmoid output of your model, then the negative log-likelihood of the entire dataset is exactly the logistic loss (or cross-entropy) function. Therefore, minimizing logistic loss is equivalent to finding the maximum likelihood estimate for the model's parameters.
Incorrect! Try again.
41In Maximum Likelihood Estimation (MLE), finding parameters that maximize the likelihood is equivalent to minimizing a specific Kullback-Leibler (KL) divergence. Given the empirical data distribution and the model distribution , which KL divergence is minimized?
Likelihood estimation in ML context
Hard
A.
B., where is a standard normal prior
C.
D.The symmetrized KL Divergence:
Correct Answer:
Explanation:
The KL divergence expands to , where the first term is the cross-entropy and the second is the entropy of the data distribution. Since is constant with respect to model parameters , minimizing the KL divergence is equivalent to minimizing the cross-entropy, which in turn is equivalent to maximizing the log-likelihood .
Incorrect! Try again.
42Consider a binary classification problem using a sigmoid activation function . While squared error loss, , can be used, logistic loss (binary cross-entropy) is strongly preferred. From an optimization standpoint, what is the primary deficiency of using squared error loss in this context?
Loss functions: squared error, logistic loss
Hard
A.It is an unbounded loss function, unlike logistic loss.
B.Its corresponding loss surface is non-convex, potentially having multiple local minima.
C.It is not differentiable with respect to the model weights.
D.It produces vanishingly small gradients for confidently misclassified samples, slowing learning.
Correct Answer: Its corresponding loss surface is non-convex, potentially having multiple local minima.
Explanation:
The logistic loss function is convex with respect to the model's weights, which guarantees that gradient descent will converge to the global minimum. The squared error loss, when combined with a sigmoid function, creates a non-convex optimization landscape. This means optimization algorithms can get stuck in local minima that are not optimal solutions, which is a significant practical problem.
Incorrect! Try again.
43In Maximum A Posteriori (MAP) estimation, the choice of prior distribution over the weights corresponds to a specific type of regularization. If the prior is a zero-mean Laplace distribution, , what form of regularization does this induce when minimizing the negative log-posterior?
Bayesian interpretation of learning models
Hard
A.L2 Regularization (Ridge)
B.L1 Regularization (Lasso)
C.Dropout
D.Elastic Net Regularization
Correct Answer: L1 Regularization (Lasso)
Explanation:
MAP estimation maximizes , which is equivalent to minimizing . The first term is the negative log-likelihood (the unregularized loss). The second term, , for a Laplace prior becomes . This is precisely the L1 regularization penalty.
Incorrect! Try again.
44In topic modeling with Latent Dirichlet Allocation (LDA), a symmetric Dirichlet prior, , is placed on the per-document topic distributions. How does the hyperparameter influence the characteristics of the topic mixtures learned for documents?
Probability distributions used in learning algorithms
Hard
A.Values of encourage dense, uniform-like mixtures, while encourages sparse topic mixtures.
B. controls the number of topics in the model.
C. only controls the variance of the topic distribution, not its sparsity.
D.Values of encourage sparse topic mixtures (few topics per document), while encourages dense, uniform-like mixtures.
Correct Answer: Values of encourage sparse topic mixtures (few topics per document), while encourages dense, uniform-like mixtures.
Explanation:
The Dirichlet distribution's hyperparameter controls the concentration of the probability mass. When , the mass is pushed towards the corners of the simplex, favoring solutions where one or a few components are large and the rest are near zero (sparsity). When , the mass is concentrated in the center, favoring solutions where all components are roughly equal (a dense mixture).
Incorrect! Try again.
45In a Naive Bayes classifier, the 'naive' assumption concerns the conditional independence of feature random variables given the class random variable . Which mathematical statement correctly represents this core assumption?
Random variables in ML models
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The Naive Bayes assumption is that features are independent of each other given the class label. This allows the joint conditional probability of the features to be factored into a product of individual conditional probabilities, which drastically simplifies the model and the estimation of its parameters. This is distinct from assuming marginal independence.
Incorrect! Try again.
46Suppose your model is misspecified, e.g., you use MLE to fit a Gaussian model to data that was truly generated from a Laplace distribution . In the limit of infinite data, the MLE parameters will converge to the parameters of the Gaussian that is 'closest' to the true Laplace distribution. What measure of 'closeness' does MLE implicitly minimize?
Likelihood estimation in ML context
Hard
A.The KL divergence
B.The Total Variation distance
C.The KL divergence
D.The L2 distance between the probability density functions
Correct Answer: The KL divergence
Explanation:
It is a fundamental result in information theory that Maximum Likelihood Estimation is equivalent to minimizing the Kullback-Leibler (KL) divergence between the empirical distribution of the data and the model's distribution. As the amount of data goes to infinity, the empirical distribution converges to the true data-generating distribution. Therefore, MLE on a misspecified model finds the parameters that minimize .
Incorrect! Try again.
47Comparing Huber loss, Logistic loss, and Squared Error loss in a classification context (), how would you rank their robustness to outliers, from most robust to least robust? An outlier is a point with a large error, e.g., a mislabeled point far from the boundary.
Loss functions: squared error, logistic loss
Hard
A.All three have similar robustness as they are all convex.
B.Squared Error Loss > Logistic Loss > Huber Loss
C.Huber Loss > Logistic Loss > Squared Error Loss
D.Logistic Loss > Huber Loss > Squared Error Loss
Correct Answer: Huber Loss > Logistic Loss > Squared Error Loss
Explanation:
Robustness to outliers is inversely related to how severely the loss function penalizes large errors. Squared Error loss grows quadratically, making it extremely sensitive to outliers. Both Logistic loss and Huber loss grow linearly for large errors, making them more robust. Huber loss is explicitly designed for robustness by being quadratic for small errors and linear for large errors, generally making it the most robust of the three.
Incorrect! Try again.
48In Bayesian modeling, a prior distribution is 'conjugate' to a likelihood if the resulting posterior is in the same probability distribution family as the prior. What is the primary computational advantage of using a conjugate prior?
Bayesian interpretation of learning models
Hard
A.It yields a closed-form analytical expression for the posterior, avoiding the need for numerical approximation methods like MCMC.
B.It guarantees that the MAP estimate will be identical to the MLE.
C.It ensures the posterior distribution is symmetric and unimodal.
D.It simplifies the calculation of model gradients for optimization.
Correct Answer: It yields a closed-form analytical expression for the posterior, avoiding the need for numerical approximation methods like MCMC.
Explanation:
The main benefit of conjugacy is that the posterior distribution can be computed analytically. The parameters of the posterior are found by simple algebraic updates to the hyperparameters of the prior using the data's sufficient statistics. This avoids the need for computationally intensive numerical methods like Markov Chain Monte Carlo (MCMC) or Variational Inference, which are often required for non-conjugate models.
Incorrect! Try again.
49The reparameterization trick is essential for training Variational Autoencoders (VAEs). For a latent variable modeled by a diagonal Gaussian , how does this trick allow gradients to flow through the sampling step?
Probability distributions used in learning algorithms
Hard
A.By using the score function estimator (REINFORCE) to estimate the gradient of the stochastic node.
B.By analytically integrating out the random variable from the loss function.
C.By replacing the sampling step with the mode of the distribution, .
D.By expressing as a deterministic function of parameters and a parameter-free random variable: , where .
Correct Answer: By expressing as a deterministic function of parameters and a parameter-free random variable: , where .
Explanation:
The reparameterization trick reframes the sampling process. Instead of sampling directly from a distribution whose parameters are outputs of a neural network (which is a stochastic operation), we sample from a fixed, simple distribution (e.g., ) and then deterministically transform this sample using the network's outputted parameters ( and ). This makes the path from the parameters to the sample (and thus the final loss) fully differentiable, allowing for standard backpropagation.
Incorrect! Try again.
50Assuming a linear regression model where the target variable is modeled as a deterministic function of inputs plus Gaussian noise, , with . Maximizing the likelihood of this model with respect to the weights is equivalent to minimizing which loss function?
Loss functions: squared error, logistic loss
Hard
A.Hinge Loss
B.Log-Cosh Loss
C.Mean Squared Error (MSE)
D.Mean Absolute Error (MAE)
Correct Answer: Mean Squared Error (MSE)
Explanation:
The probability of observing given is . The negative log-likelihood is . To maximize the likelihood, we must minimize this expression. Since is a constant, this is equivalent to minimizing the sum of squared errors, .
Incorrect! Try again.
51Given i.i.d. samples from an Exponential distribution with PDF for , what is the Maximum Likelihood Estimate (MLE) for the rate parameter ?
Likelihood estimation in ML context
Hard
A.The reciprocal of the sample variance
B.The sample mean,
C.The reciprocal of the sample mean,
D.The sample variance
Correct Answer: The reciprocal of the sample mean,
Explanation:
The log-likelihood is . Taking the derivative with respect to and setting it to zero gives . Solving for yields .
Incorrect! Try again.
52The Maximum A Posteriori (MAP) estimate incorporates a prior belief about the parameters, while the Maximum Likelihood Estimate (MLE) does not. Under which specific condition on the prior distribution do the MAP and MLE estimates become identical?
Bayesian interpretation of learning models
Hard
A.When the dataset size approaches infinity, for any valid prior.
B.When the prior is a uniform distribution over the parameter space.
C.When the posterior distribution is Gaussian.
D.When the likelihood function is from the exponential family.
Correct Answer: When the prior is a uniform distribution over the parameter space.
Explanation:
MAP estimation maximizes , while MLE maximizes only . The two objectives are equivalent if is constant with respect to . This is precisely the case for a uniform prior, which assigns equal probability to all parameter values. While the estimates often converge as data size increases (Option C), they are only strictly identical for any dataset size when the prior is uniform.
Incorrect! Try again.
53A 2D random variable follows a multivariate Gaussian distribution with covariance matrix . Which statement about the relationship between the random variables and is correct?
Probability distributions used in learning algorithms
Hard
A. and are positively correlated and are not independent.
B. and are independent because the distribution is Gaussian.
C.The marginal distribution of is not Gaussian.
D. and are negatively correlated.
Correct Answer: and are positively correlated and are not independent.
Explanation:
For a multivariate Gaussian distribution, independence between components is equivalent to zero covariance. The off-diagonal element of the covariance matrix, , is 2. Since this is non-zero, and are not independent. Since it is positive, they are positively correlated. A key property of multivariate Gaussians is that their marginals are also Gaussian.
Incorrect! Try again.
54The bias of an estimator for a true function is defined as , where the expectation is over all possible training datasets . What is the direct implication of an estimator being 'unbiased'?
Random variables in ML models
Hard
A.The estimator's prediction for any single training dataset is equal to the true value.
B.The average of the estimator's predictions over all possible training datasets is equal to the true value.
C.The estimator is guaranteed to have the lowest possible Mean Squared Error.
D.The estimator has zero variance.
Correct Answer: The average of the estimator's predictions over all possible training datasets is equal to the true value.
Explanation:
An unbiased estimator is one whose expected value is the true value of the quantity being estimated. This means it does not systematically over- or under-estimate. It does not imply that its prediction for any given dataset will be correct, only that its errors will average out to zero over the distribution of possible datasets.
Incorrect! Try again.
55Minimizing the cross-entropy loss is a standard approach in classification. This is often described as being equivalent to minimizing the KL divergence . Why is this equivalence valid in the context of training a machine learning model?
Loss functions: squared error, logistic loss
Hard
A.The equivalence only holds if the model output is a Gaussian distribution.
B.Because KL divergence is simply another name for cross-entropy.
C.Because the entropy of the true data distribution, , is a constant with respect to the model's parameters that determine .
D.Because cross-entropy is symmetric, so , just like KL divergence.
Correct Answer: Because the entropy of the true data distribution, , is a constant with respect to the model's parameters that determine .
Explanation:
The KL divergence is defined as , where is the cross-entropy and is the entropy of the true distribution . During model training, the true labels (and thus ) are fixed. Therefore, is a constant. When optimizing with respect to model parameters, any constant term can be dropped, so minimizing is equivalent to minimizing just .
Incorrect! Try again.
56Early stopping is a regularization technique where training of an iterative algorithm (like a neural network) is halted when validation performance stops improving. What is the common Bayesian interpretation of this procedure?
Bayesian interpretation of learning models
Hard
A.It is an approximation of performing MAP estimation with a Gaussian prior on the weights, where the prior's variance is implicitly controlled by the stopping time.
B.It corresponds to maximizing the marginal likelihood (model evidence) instead of the posterior.
C.It has no valid Bayesian interpretation and is considered a purely heuristic method.
D.It is equivalent to using a Laplace (L1) prior for inducing sparsity in the weights.
Correct Answer: It is an approximation of performing MAP estimation with a Gaussian prior on the weights, where the prior's variance is implicitly controlled by the stopping time.
Explanation:
For many models, especially those optimized with gradient descent, stopping the training process early has a similar effect to L2 regularization (weight decay). L2 regularization, in turn, is equivalent to placing a zero-mean Gaussian prior on the weights in a MAP estimation framework. A shorter training time corresponds to a stronger regularization penalty, which is like using a Gaussian prior with a smaller variance, thus shrinking weights more aggressively towards zero.
Incorrect! Try again.
57The invariance property of Maximum Likelihood Estimators (MLEs) states that if is the MLE for a parameter , and is a function, then the MLE for the transformed parameter is simply . Which of the following is a key condition for this property to hold?
Likelihood estimation in ML context
Hard
A.The property only holds if the likelihood belongs to the exponential family.
B.The property holds for any function , it is not restricted to being one-to-one.
C.The property only holds if the function is linear.
D.The property only holds if the function is bijective (one-to-one and onto).
Correct Answer: The property holds for any function , it is not restricted to being one-to-one.
Explanation:
The invariance principle is a very general property of MLEs. It states that the MLE of a transformed parameter is the transformation of the MLE of the original parameter. This holds for any function , not just invertible or linear ones. For example, if is the MLE for the standard deviation, then is the MLE for the variance, even though the squaring function is not one-to-one for all real numbers.
Incorrect! Try again.
58A Poisson process describes the number of events occurring in a fixed time interval, with an average rate of events per unit time. What is the probability distribution that models the waiting time between consecutive events in this process?
Probability distributions used in learning algorithms
Hard
A.Poisson distribution with mean .
B.Exponential distribution with rate parameter .
C.Normal distribution with mean .
D.Gamma distribution with shape parameter 2.
Correct Answer: Exponential distribution with rate parameter .
Explanation:
There is a fundamental relationship between the Poisson and Exponential distributions. If the count of events in a time interval follows a Poisson distribution with rate , then the time interval between consecutive events (the inter-event waiting time) is described by an Exponential distribution with the same rate parameter . Its mean waiting time is .
Incorrect! Try again.
59The Fisher Information Matrix, , quantifies the information a random variable carries about an unknown parameter . It has a deep connection to the geometry of the likelihood surface. What is its relationship to the Hessian matrix, , of the negative log-likelihood function?
Loss functions: squared error, logistic loss
Hard
A.The Fisher Information is the expectation of the Hessian of the negative log-likelihood.
B.The Fisher Information is the determinant of the Hessian.
C.The Fisher Information is always the identity matrix when the Hessian is positive definite.
D.The Fisher Information is the inverse of the Hessian.
Correct Answer: The Fisher Information is the expectation of the Hessian of the negative log-likelihood.
Explanation:
A key result is that, under certain regularity conditions, the Fisher Information Matrix is equal to the expected value of the observed information. The observed information is the Hessian of the negative log-likelihood function. This means . This connects the expected curvature of the loss surface to the amount of information in the data, and is foundational to methods like the natural gradient.
Incorrect! Try again.
60In the evidence framework for Bayesian model selection, one maximizes the marginal likelihood to choose a model . How does this process naturally implement Occam's Razor?
Bayesian interpretation of learning models
Hard
A.It forces the posterior distribution to be unimodal, favoring simpler explanations.
B.It penalizes overly complex models because they must spread their predictive probability over a larger space of possible datasets, reducing the probability assigned to the observed data.
C.It integrates out nuisance parameters, which is equivalent to L0 regularization.
D.It sets the prior probability of complex models, , to be exponentially lower.
Correct Answer: It penalizes overly complex models because they must spread their predictive probability over a larger space of possible datasets, reducing the probability assigned to the observed data.
Explanation:
The marginal likelihood, or evidence, automatically balances model fit and complexity. A simple model can only explain a small range of datasets well. A very complex model is flexible enough to explain many different datasets. This flexibility means it must spread its total probability of 1 over all these possibilities. Consequently, the probability it assigns to any single, specific dataset (the one we observed) is lower than that of a simpler model that fits the data just as well. Maximizing the evidence thus favors the simplest model that can adequately explain the data.