Unit 3 - Practice Quiz

INT255 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 In machine learning, what does a random variable represent?

Random variables in ML models Easy
A. A fixed hyperparameter of a model, like the learning rate.
B. The name of the machine learning algorithm.
C. A variable that is always zero.
D. A variable whose value is a numerical outcome of a random phenomenon.

2 In a dataset for classifying handwritten digits (0-9), the digit's label is an example of a...

Random variables in ML models Easy
A. Discrete random variable
B. Constant variable
C. Continuous random variable
D. Negative random variable

3 Which of the following is a clear example of a continuous random variable in a machine learning model?

Random variables in ML models Easy
A. The star rating (1 to 5) of a product.
B. The number of 'likes' on a social media post.
C. Whether an email is 'spam' or 'not spam'.
D. The temperature in Celsius predicted by a weather model.

4 A random variable provides a numerical summary of a...

Random variables in ML models Easy
A. Random outcome
B. Model's complexity
C. Dataset's size
D. Fixed algorithm

5 Which probability distribution is commonly used to model binary outcomes, such as whether a customer will click on an ad or not?

Probability distributions used in learning algorithms Easy
A. Bernoulli distribution
B. Exponential distribution
C. Gaussian (Normal) distribution
D. Poisson distribution

6 The Gaussian distribution, often called the 'bell curve', is primarily defined by which two parameters?

Probability distributions used in learning algorithms Easy
A. Rate and time
B. Minimum and maximum
C. Mean and variance
D. Number of trials and probability of success

7 If you are modeling the number of typographical errors on a page of a book, which discrete probability distribution is most appropriate?

Probability distributions used in learning algorithms Easy
A. Poisson distribution
B. Gaussian distribution
C. Bernoulli distribution
D. Uniform distribution

8 What does a probability distribution fundamentally describe?

Probability distributions used in learning algorithms Easy
A. The total number of data points in a set.
B. The exact value a variable will take.
C. The probability of each possible outcome of a random variable.
D. The speed of a learning algorithm.

9 What is the primary goal of the Maximum Likelihood Estimation (MLE) method?

Likelihood estimation in ML context Easy
A. To prove that a model is 100% correct.
B. To minimize the number of features in the dataset.
C. To find the model parameters that maximize the probability of observing the given data.
D. To calculate the prior probability of the parameters.

10 The likelihood function is the probability of the observed data treated as a function of the...

Likelihood estimation in ML context Easy
A. Data
B. Parameters
C. Prior probability
D. Loss function

11 Why is the log-likelihood often maximized instead of the likelihood itself?

Likelihood estimation in ML context Easy
A. It is the only way to handle data with negative values.
B. It gives a completely different and better result.
C. It is a requirement for all regression models.
D. It is mathematically more convenient, turning products into sums.

12 In a linear regression model, assuming the errors are normally distributed, MLE is equivalent to minimizing which loss function?

Likelihood estimation in ML context Easy
A. Hinge Loss
B. Squared Error Loss
C. Absolute Error Loss
D. Logistic Loss

13 The Mean Squared Error (MSE) loss function is most suitable for which type of machine learning problem?

Loss functions: squared error, logistic loss Easy
A. Regression
B. Reinforcement Learning
C. Classification
D. Clustering

14 If a model predicts a house price of $300,000 and the actual price is $310,000, what is the squared error for this single prediction?

Loss functions: squared error, logistic loss Easy
A. $300,000$
B.
C. $100,000,000$
D. $10,000$

15 Logistic Loss, also known as Binary Cross-Entropy, is primarily used in which type of task?

Loss functions: squared error, logistic loss Easy
A. Anomaly Detection
B. Linear Regression
C. Binary Classification
D. Time Series Forecasting

16 What is the fundamental role of a loss function during the training of a machine learning model?

Loss functions: squared error, logistic loss Easy
A. To quantify the model's error, which the training process tries to minimize.
B. To select which algorithm to use.
C. To preprocess and clean the input data.
D. To count the number of data points in the training set.

17 In the Bayesian formula , what does the term represent?

Bayesian interpretation of learning models Easy
A. The Posterior: our belief about the parameters after seeing data .
B. The Prior: our belief about the parameters before seeing data .
C. The Evidence: the probability of the data.
D. The Likelihood: the probability of the data given the parameters.

18 What is the 'posterior' probability in the context of Bayesian learning?

Bayesian interpretation of learning models Easy
A. The updated probability of a hypothesis after considering new evidence.
B. The initial guess about a hypothesis before seeing any data.
C. The probability that the evidence is correct.
D. A type of loss function.

19 A key difference between the Bayesian and frequentist approaches is that the Bayesian approach treats model parameters as...

Bayesian interpretation of learning models Easy
A. Random variables
B. Fixed constants
C. Always positive numbers
D. Unnecessary for prediction

20 What does Maximum a Posteriori (MAP) estimation do?

Bayesian interpretation of learning models Easy
A. It finds the parameter values that maximize only the prior.
B. It finds the parameter values that maximize only the likelihood.
C. It finds the parameter values that maximize the posterior probability.
D. It calculates the mean of the posterior distribution.

21 In a standard linear regression model , which components are treated as random variables from a modeling perspective?

Random variables in ML models Medium
A. Only the target variable
B. The target variable and the error term
C. Only the parameters
D. The feature matrix and the parameters

22 Which probability distribution is most suitable for modeling the class labels in a multi-class classification problem with classes for a single data point?

Probability distributions used in learning algorithms Medium
A. Categorical distribution
B. Binomial distribution
C. Poisson distribution
D. Gaussian distribution

23 Assuming a dataset of i.i.d. data points and a model that gives , the likelihood function is defined as:

Likelihood estimation in ML context Medium
A.
B.
C.
D. without the i.i.d. assumption

24 A binary classifier predicts a probability of for a data point whose true label is . What is the logistic loss (cross-entropy) for this single prediction?

Loss functions: squared error, logistic loss Medium
A. $0.2$
B.
C.
D.

25 In a Bayesian framework, what is the relationship between the posterior, prior, and likelihood?

Bayesian interpretation of learning models Medium
A. Prior Likelihood Posterior
B. Posterior Likelihood Prior
C. Likelihood Posterior Prior
D. Posterior Likelihood + Prior

26 You are building a model to predict the number of customer support emails your company will receive in a one-hour period. Which probability distribution is a common and suitable choice for this task?

Probability distributions used in learning algorithms Medium
A. Uniform distribution
B. Gaussian distribution
C. Poisson distribution
D. Bernoulli distribution

27 Why is squared error loss generally not a good choice for classification problems?

Loss functions: squared error, logistic loss Medium
A. It penalizes confident, incorrect predictions too little.
B. It cannot be optimized using gradient descent.
C. It assumes the target variable is continuous.
D. Its derivative is always zero for classification.

28 Maximizing the log-likelihood is equivalent to maximizing the likelihood itself. What is the primary practical advantage of optimizing the log-likelihood?

Likelihood estimation in ML context Medium
A. It converts products into sums, which are numerically more stable and analytically simpler.
B. It always results in a convex optimization problem.
C. It is computationally faster to compute a single logarithm than a single product.
D. It incorporates a prior belief about the parameters.

29 When we say a machine learning model provides a 'probabilistic prediction', what does this imply about its output?

Random variables in ML models Medium
A. The model's output is guaranteed to be correct with a certain probability.
B. The model's output is a probability distribution over possible outcomes, not just a single point estimate.
C. The model's parameters are updated randomly during training.
D. The model uses a random number generator to make predictions.

30 What is the primary conceptual difference between Maximum a Posteriori (MAP) and Maximum Likelihood Estimation (MLE)?

Bayesian interpretation of learning models Medium
A. MAP is a Bayesian method, while MLE is a frequentist method that cannot be interpreted probabilistically.
B. MAP maximizes the probability of the data given the parameters, while MLE maximizes the posterior.
C. MLE is used for regression while MAP is used for classification.
D. MAP incorporates a prior belief about the model parameters, while MLE does not.

31 If we assume that the target variable in a regression problem follows a Gaussian distribution with mean and constant variance , maximizing the likelihood of the model parameters is equivalent to minimizing which loss function?

Likelihood estimation in ML context Medium
A. Sum of Squared Errors:
B. Logistic Loss
C. Sum of Absolute Errors:
D. Hinge Loss

32 In logistic regression, the sigmoid function is used to model the probability of the positive class. This probability is the parameter of which underlying probability distribution for the binary target variable?

Probability distributions used in learning algorithms Medium
A. Bernoulli distribution
B. Categorical distribution
C. Binomial distribution
D. Gaussian distribution

33 A regression model predicts a value of 150 for a data point with a true value of 100. Another model predicts 200 for a true value of 150. How do their squared error losses compare?

Loss functions: squared error, logistic loss Medium
A. The second prediction has a higher loss.
B. The comparison is impossible without knowing the model.
C. The loss is the same for both predictions.
D. The first prediction has a higher loss.

34 Performing MAP estimation with a Gaussian prior on the model weights is equivalent to performing MLE with which type of regularization?

Bayesian interpretation of learning models Medium
A. Dropout
B. No regularization
C. L1 Regularization (Lasso)
D. L2 Regularization (Ridge)

35 In a classification model, the output for a given input is a vector of probabilities for classes A, B, and C respectively. This vector can be interpreted as the parameters of which random variable?

Random variables in ML models Medium
A. A set of three independent Bernoulli random variables
B. A Gaussian random variable representing the prediction error
C. A Binomial random variable representing the count of correct predictions
D. A Categorical random variable representing the predicted class

36 You have a biased coin where the probability of heads, , is unknown. You flip it 5 times and observe the sequence H, T, H, H, T. What is the Maximum Likelihood Estimate (MLE) for ?

Likelihood estimation in ML context Medium
A. It cannot be determined from this data.
B.
C.
D.

37 Consider two regression models. Model A has a Root Mean Squared Error (RMSE) of 10. Model B has a Mean Absolute Error (MAE) of 10. What can we definitively conclude?

Loss functions: squared error, logistic loss Medium
A. Both models have the same predictive accuracy.
B. Model B is better than Model A.
C. Model A is better than Model B.
D. We cannot directly compare the models as they use different error metrics.

38 A key assumption of the Naive Bayes algorithm is that the features are conditionally independent given the class label. How does this assumption simplify the model's probability calculations?

Probability distributions used in learning algorithms Medium
A. It forces all features to follow a Gaussian distribution.
B. It allows the joint probability of features to be calculated as the product of individual probabilities .
C. It ensures that the posterior probability is always greater than the prior.
D. It eliminates the need to calculate the likelihood.

39 Under what condition does the Maximum a Posteriori (MAP) estimate for a parameter become exactly the same as the Maximum Likelihood Estimate (MLE)?

Bayesian interpretation of learning models Medium
A. When the posterior distribution is symmetric.
B. When the likelihood function is Gaussian.
C. When the dataset is very small.
D. When the prior distribution is a uniform distribution.

40 The logistic loss function for binary classification is derived from which principle?

Loss functions: squared error, logistic loss Medium
A. Maximizing the likelihood of the data under a Bernoulli distribution assumption.
B. Minimizing the absolute error of the predicted probabilities.
C. Minimizing the squared distance between the prediction and the true label.
D. Finding the maximum margin hyperplane between classes.

41 In Maximum Likelihood Estimation (MLE), finding parameters that maximize the likelihood is equivalent to minimizing a specific Kullback-Leibler (KL) divergence. Given the empirical data distribution and the model distribution , which KL divergence is minimized?

Likelihood estimation in ML context Hard
A. , where is a standard normal prior
B.
C.
D. The symmetrized KL Divergence:

42 Consider a binary classification problem using a sigmoid activation function . While squared error loss, , can be used, logistic loss (binary cross-entropy) is strongly preferred. From an optimization standpoint, what is the primary deficiency of using squared error loss in this context?

Loss functions: squared error, logistic loss Hard
A. It produces vanishingly small gradients for confidently misclassified samples, slowing learning.
B. It is an unbounded loss function, unlike logistic loss.
C. Its corresponding loss surface is non-convex, potentially having multiple local minima.
D. It is not differentiable with respect to the model weights.

43 In Maximum A Posteriori (MAP) estimation, the choice of prior distribution over the weights corresponds to a specific type of regularization. If the prior is a zero-mean Laplace distribution, , what form of regularization does this induce when minimizing the negative log-posterior?

Bayesian interpretation of learning models Hard
A. L1 Regularization (Lasso)
B. Dropout
C. L2 Regularization (Ridge)
D. Elastic Net Regularization

44 In topic modeling with Latent Dirichlet Allocation (LDA), a symmetric Dirichlet prior, , is placed on the per-document topic distributions. How does the hyperparameter influence the characteristics of the topic mixtures learned for documents?

Probability distributions used in learning algorithms Hard
A. only controls the variance of the topic distribution, not its sparsity.
B. Values of encourage dense, uniform-like mixtures, while encourages sparse topic mixtures.
C. controls the number of topics in the model.
D. Values of encourage sparse topic mixtures (few topics per document), while encourages dense, uniform-like mixtures.

45 In a Naive Bayes classifier, the 'naive' assumption concerns the conditional independence of feature random variables given the class random variable . Which mathematical statement correctly represents this core assumption?

Random variables in ML models Hard
A.
B.
C.
D.

46 Suppose your model is misspecified, e.g., you use MLE to fit a Gaussian model to data that was truly generated from a Laplace distribution . In the limit of infinite data, the MLE parameters will converge to the parameters of the Gaussian that is 'closest' to the true Laplace distribution. What measure of 'closeness' does MLE implicitly minimize?

Likelihood estimation in ML context Hard
A. The KL divergence
B. The KL divergence
C. The L2 distance between the probability density functions
D. The Total Variation distance

47 Comparing Huber loss, Logistic loss, and Squared Error loss in a classification context (), how would you rank their robustness to outliers, from most robust to least robust? An outlier is a point with a large error, e.g., a mislabeled point far from the boundary.

Loss functions: squared error, logistic loss Hard
A. Logistic Loss > Huber Loss > Squared Error Loss
B. Squared Error Loss > Logistic Loss > Huber Loss
C. Huber Loss > Logistic Loss > Squared Error Loss
D. All three have similar robustness as they are all convex.

48 In Bayesian modeling, a prior distribution is 'conjugate' to a likelihood if the resulting posterior is in the same probability distribution family as the prior. What is the primary computational advantage of using a conjugate prior?

Bayesian interpretation of learning models Hard
A. It yields a closed-form analytical expression for the posterior, avoiding the need for numerical approximation methods like MCMC.
B. It simplifies the calculation of model gradients for optimization.
C. It guarantees that the MAP estimate will be identical to the MLE.
D. It ensures the posterior distribution is symmetric and unimodal.

49 The reparameterization trick is essential for training Variational Autoencoders (VAEs). For a latent variable modeled by a diagonal Gaussian , how does this trick allow gradients to flow through the sampling step?

Probability distributions used in learning algorithms Hard
A. By replacing the sampling step with the mode of the distribution, .
B. By expressing as a deterministic function of parameters and a parameter-free random variable: , where .
C. By using the score function estimator (REINFORCE) to estimate the gradient of the stochastic node.
D. By analytically integrating out the random variable from the loss function.

50 Assuming a linear regression model where the target variable is modeled as a deterministic function of inputs plus Gaussian noise, , with . Maximizing the likelihood of this model with respect to the weights is equivalent to minimizing which loss function?

Loss functions: squared error, logistic loss Hard
A. Hinge Loss
B. Mean Absolute Error (MAE)
C. Log-Cosh Loss
D. Mean Squared Error (MSE)

51 Given i.i.d. samples from an Exponential distribution with PDF for , what is the Maximum Likelihood Estimate (MLE) for the rate parameter ?

Likelihood estimation in ML context Hard
A. The sample variance
B. The reciprocal of the sample mean,
C. The reciprocal of the sample variance
D. The sample mean,

52 The Maximum A Posteriori (MAP) estimate incorporates a prior belief about the parameters, while the Maximum Likelihood Estimate (MLE) does not. Under which specific condition on the prior distribution do the MAP and MLE estimates become identical?

Bayesian interpretation of learning models Hard
A. When the posterior distribution is Gaussian.
B. When the likelihood function is from the exponential family.
C. When the prior is a uniform distribution over the parameter space.
D. When the dataset size approaches infinity, for any valid prior.

53 A 2D random variable follows a multivariate Gaussian distribution with covariance matrix . Which statement about the relationship between the random variables and is correct?

Probability distributions used in learning algorithms Hard
A. The marginal distribution of is not Gaussian.
B. and are positively correlated and are not independent.
C. and are independent because the distribution is Gaussian.
D. and are negatively correlated.

54 The bias of an estimator for a true function is defined as , where the expectation is over all possible training datasets . What is the direct implication of an estimator being 'unbiased'?

Random variables in ML models Hard
A. The average of the estimator's predictions over all possible training datasets is equal to the true value.
B. The estimator is guaranteed to have the lowest possible Mean Squared Error.
C. The estimator has zero variance.
D. The estimator's prediction for any single training dataset is equal to the true value.

55 Minimizing the cross-entropy loss is a standard approach in classification. This is often described as being equivalent to minimizing the KL divergence . Why is this equivalence valid in the context of training a machine learning model?

Loss functions: squared error, logistic loss Hard
A. Because KL divergence is simply another name for cross-entropy.
B. The equivalence only holds if the model output is a Gaussian distribution.
C. Because the entropy of the true data distribution, , is a constant with respect to the model's parameters that determine .
D. Because cross-entropy is symmetric, so , just like KL divergence.

56 Early stopping is a regularization technique where training of an iterative algorithm (like a neural network) is halted when validation performance stops improving. What is the common Bayesian interpretation of this procedure?

Bayesian interpretation of learning models Hard
A. It is equivalent to using a Laplace (L1) prior for inducing sparsity in the weights.
B. It has no valid Bayesian interpretation and is considered a purely heuristic method.
C. It is an approximation of performing MAP estimation with a Gaussian prior on the weights, where the prior's variance is implicitly controlled by the stopping time.
D. It corresponds to maximizing the marginal likelihood (model evidence) instead of the posterior.

57 The invariance property of Maximum Likelihood Estimators (MLEs) states that if is the MLE for a parameter , and is a function, then the MLE for the transformed parameter is simply . Which of the following is a key condition for this property to hold?

Likelihood estimation in ML context Hard
A. The property only holds if the function is bijective (one-to-one and onto).
B. The property holds for any function , it is not restricted to being one-to-one.
C. The property only holds if the function is linear.
D. The property only holds if the likelihood belongs to the exponential family.

58 A Poisson process describes the number of events occurring in a fixed time interval, with an average rate of events per unit time. What is the probability distribution that models the waiting time between consecutive events in this process?

Probability distributions used in learning algorithms Hard
A. Exponential distribution with rate parameter .
B. Gamma distribution with shape parameter 2.
C. Poisson distribution with mean .
D. Normal distribution with mean .

59 The Fisher Information Matrix, , quantifies the information a random variable carries about an unknown parameter . It has a deep connection to the geometry of the likelihood surface. What is its relationship to the Hessian matrix, , of the negative log-likelihood function?

Loss functions: squared error, logistic loss Hard
A. The Fisher Information is the expectation of the Hessian of the negative log-likelihood.
B. The Fisher Information is the determinant of the Hessian.
C. The Fisher Information is the inverse of the Hessian.
D. The Fisher Information is always the identity matrix when the Hessian is positive definite.

60 In the evidence framework for Bayesian model selection, one maximizes the marginal likelihood to choose a model . How does this process naturally implement Occam's Razor?

Bayesian interpretation of learning models Hard
A. It sets the prior probability of complex models, , to be exponentially lower.
B. It forces the posterior distribution to be unimodal, favoring simpler explanations.
C. It penalizes overly complex models because they must spread their predictive probability over a larger space of possible datasets, reducing the probability assigned to the observed data.
D. It integrates out nuisance parameters, which is equivalent to L0 regularization.