Unit 3 - Practice Quiz

INT255 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 In machine learning, what does a random variable represent?

Random variables in ML models Easy
A. A variable whose value is a numerical outcome of a random phenomenon.
B. The name of the machine learning algorithm.
C. A fixed hyperparameter of a model, like the learning rate.
D. A variable that is always zero.

2 In a dataset for classifying handwritten digits (0-9), the digit's label is an example of a...

Random variables in ML models Easy
A. Negative random variable
B. Discrete random variable
C. Continuous random variable
D. Constant variable

3 Which of the following is a clear example of a continuous random variable in a machine learning model?

Random variables in ML models Easy
A. The number of 'likes' on a social media post.
B. Whether an email is 'spam' or 'not spam'.
C. The star rating (1 to 5) of a product.
D. The temperature in Celsius predicted by a weather model.

4 A random variable provides a numerical summary of a...

Random variables in ML models Easy
A. Dataset's size
B. Fixed algorithm
C. Model's complexity
D. Random outcome

5 Which probability distribution is commonly used to model binary outcomes, such as whether a customer will click on an ad or not?

Probability distributions used in learning algorithms Easy
A. Poisson distribution
B. Gaussian (Normal) distribution
C. Bernoulli distribution
D. Exponential distribution

6 The Gaussian distribution, often called the 'bell curve', is primarily defined by which two parameters?

Probability distributions used in learning algorithms Easy
A. Mean and variance
B. Minimum and maximum
C. Rate and time
D. Number of trials and probability of success

7 If you are modeling the number of typographical errors on a page of a book, which discrete probability distribution is most appropriate?

Probability distributions used in learning algorithms Easy
A. Gaussian distribution
B. Uniform distribution
C. Bernoulli distribution
D. Poisson distribution

8 What does a probability distribution fundamentally describe?

Probability distributions used in learning algorithms Easy
A. The exact value a variable will take.
B. The total number of data points in a set.
C. The probability of each possible outcome of a random variable.
D. The speed of a learning algorithm.

9 What is the primary goal of the Maximum Likelihood Estimation (MLE) method?

Likelihood estimation in ML context Easy
A. To prove that a model is 100% correct.
B. To calculate the prior probability of the parameters.
C. To minimize the number of features in the dataset.
D. To find the model parameters that maximize the probability of observing the given data.

10 The likelihood function is the probability of the observed data treated as a function of the...

Likelihood estimation in ML context Easy
A. Prior probability
B. Parameters
C. Data
D. Loss function

11 Why is the log-likelihood often maximized instead of the likelihood itself?

Likelihood estimation in ML context Easy
A. It is the only way to handle data with negative values.
B. It is a requirement for all regression models.
C. It gives a completely different and better result.
D. It is mathematically more convenient, turning products into sums.

12 In a linear regression model, assuming the errors are normally distributed, MLE is equivalent to minimizing which loss function?

Likelihood estimation in ML context Easy
A. Absolute Error Loss
B. Logistic Loss
C. Hinge Loss
D. Squared Error Loss

13 The Mean Squared Error (MSE) loss function is most suitable for which type of machine learning problem?

Loss functions: squared error, logistic loss Easy
A. Reinforcement Learning
B. Clustering
C. Classification
D. Regression

14 If a model predicts a house price of $300,000 and the actual price is $310,000, what is the squared error for this single prediction?

Loss functions: squared error, logistic loss Easy
A. $300,000$
B.
C. $10,000$
D. $100,000,000$

15 Logistic Loss, also known as Binary Cross-Entropy, is primarily used in which type of task?

Loss functions: squared error, logistic loss Easy
A. Linear Regression
B. Anomaly Detection
C. Time Series Forecasting
D. Binary Classification

16 What is the fundamental role of a loss function during the training of a machine learning model?

Loss functions: squared error, logistic loss Easy
A. To quantify the model's error, which the training process tries to minimize.
B. To count the number of data points in the training set.
C. To select which algorithm to use.
D. To preprocess and clean the input data.

17 In the Bayesian formula , what does the term represent?

Bayesian interpretation of learning models Easy
A. The Prior: our belief about the parameters before seeing data .
B. The Evidence: the probability of the data.
C. The Likelihood: the probability of the data given the parameters.
D. The Posterior: our belief about the parameters after seeing data .

18 What is the 'posterior' probability in the context of Bayesian learning?

Bayesian interpretation of learning models Easy
A. The initial guess about a hypothesis before seeing any data.
B. The updated probability of a hypothesis after considering new evidence.
C. The probability that the evidence is correct.
D. A type of loss function.

19 A key difference between the Bayesian and frequentist approaches is that the Bayesian approach treats model parameters as...

Bayesian interpretation of learning models Easy
A. Unnecessary for prediction
B. Always positive numbers
C. Random variables
D. Fixed constants

20 What does Maximum a Posteriori (MAP) estimation do?

Bayesian interpretation of learning models Easy
A. It finds the parameter values that maximize only the likelihood.
B. It finds the parameter values that maximize the posterior probability.
C. It calculates the mean of the posterior distribution.
D. It finds the parameter values that maximize only the prior.

21 In a standard linear regression model , which components are treated as random variables from a modeling perspective?

Random variables in ML models Medium
A. Only the target variable
B. The feature matrix and the parameters
C. The target variable and the error term
D. Only the parameters

22 Which probability distribution is most suitable for modeling the class labels in a multi-class classification problem with classes for a single data point?

Probability distributions used in learning algorithms Medium
A. Categorical distribution
B. Poisson distribution
C. Binomial distribution
D. Gaussian distribution

23 Assuming a dataset of i.i.d. data points and a model that gives , the likelihood function is defined as:

Likelihood estimation in ML context Medium
A. without the i.i.d. assumption
B.
C.
D.

24 A binary classifier predicts a probability of for a data point whose true label is . What is the logistic loss (cross-entropy) for this single prediction?

Loss functions: squared error, logistic loss Medium
A.
B.
C.
D. $0.2$

25 In a Bayesian framework, what is the relationship between the posterior, prior, and likelihood?

Bayesian interpretation of learning models Medium
A. Prior Likelihood Posterior
B. Posterior Likelihood + Prior
C. Likelihood Posterior Prior
D. Posterior Likelihood Prior

26 You are building a model to predict the number of customer support emails your company will receive in a one-hour period. Which probability distribution is a common and suitable choice for this task?

Probability distributions used in learning algorithms Medium
A. Gaussian distribution
B. Poisson distribution
C. Uniform distribution
D. Bernoulli distribution

27 Why is squared error loss generally not a good choice for classification problems?

Loss functions: squared error, logistic loss Medium
A. Its derivative is always zero for classification.
B. It cannot be optimized using gradient descent.
C. It assumes the target variable is continuous.
D. It penalizes confident, incorrect predictions too little.

28 Maximizing the log-likelihood is equivalent to maximizing the likelihood itself. What is the primary practical advantage of optimizing the log-likelihood?

Likelihood estimation in ML context Medium
A. It converts products into sums, which are numerically more stable and analytically simpler.
B. It always results in a convex optimization problem.
C. It incorporates a prior belief about the parameters.
D. It is computationally faster to compute a single logarithm than a single product.

29 When we say a machine learning model provides a 'probabilistic prediction', what does this imply about its output?

Random variables in ML models Medium
A. The model uses a random number generator to make predictions.
B. The model's output is guaranteed to be correct with a certain probability.
C. The model's output is a probability distribution over possible outcomes, not just a single point estimate.
D. The model's parameters are updated randomly during training.

30 What is the primary conceptual difference between Maximum a Posteriori (MAP) and Maximum Likelihood Estimation (MLE)?

Bayesian interpretation of learning models Medium
A. MAP incorporates a prior belief about the model parameters, while MLE does not.
B. MAP is a Bayesian method, while MLE is a frequentist method that cannot be interpreted probabilistically.
C. MLE is used for regression while MAP is used for classification.
D. MAP maximizes the probability of the data given the parameters, while MLE maximizes the posterior.

31 If we assume that the target variable in a regression problem follows a Gaussian distribution with mean and constant variance , maximizing the likelihood of the model parameters is equivalent to minimizing which loss function?

Likelihood estimation in ML context Medium
A. Sum of Absolute Errors:
B. Sum of Squared Errors:
C. Logistic Loss
D. Hinge Loss

32 In logistic regression, the sigmoid function is used to model the probability of the positive class. This probability is the parameter of which underlying probability distribution for the binary target variable?

Probability distributions used in learning algorithms Medium
A. Bernoulli distribution
B. Binomial distribution
C. Categorical distribution
D. Gaussian distribution

33 A regression model predicts a value of 150 for a data point with a true value of 100. Another model predicts 200 for a true value of 150. How do their squared error losses compare?

Loss functions: squared error, logistic loss Medium
A. The loss is the same for both predictions.
B. The second prediction has a higher loss.
C. The comparison is impossible without knowing the model.
D. The first prediction has a higher loss.

34 Performing MAP estimation with a Gaussian prior on the model weights is equivalent to performing MLE with which type of regularization?

Bayesian interpretation of learning models Medium
A. L2 Regularization (Ridge)
B. No regularization
C. L1 Regularization (Lasso)
D. Dropout

35 In a classification model, the output for a given input is a vector of probabilities for classes A, B, and C respectively. This vector can be interpreted as the parameters of which random variable?

Random variables in ML models Medium
A. A set of three independent Bernoulli random variables
B. A Gaussian random variable representing the prediction error
C. A Binomial random variable representing the count of correct predictions
D. A Categorical random variable representing the predicted class

36 You have a biased coin where the probability of heads, , is unknown. You flip it 5 times and observe the sequence H, T, H, H, T. What is the Maximum Likelihood Estimate (MLE) for ?

Likelihood estimation in ML context Medium
A. It cannot be determined from this data.
B.
C.
D.

37 Consider two regression models. Model A has a Root Mean Squared Error (RMSE) of 10. Model B has a Mean Absolute Error (MAE) of 10. What can we definitively conclude?

Loss functions: squared error, logistic loss Medium
A. We cannot directly compare the models as they use different error metrics.
B. Model A is better than Model B.
C. Both models have the same predictive accuracy.
D. Model B is better than Model A.

38 A key assumption of the Naive Bayes algorithm is that the features are conditionally independent given the class label. How does this assumption simplify the model's probability calculations?

Probability distributions used in learning algorithms Medium
A. It allows the joint probability of features to be calculated as the product of individual probabilities .
B. It eliminates the need to calculate the likelihood.
C. It forces all features to follow a Gaussian distribution.
D. It ensures that the posterior probability is always greater than the prior.

39 Under what condition does the Maximum a Posteriori (MAP) estimate for a parameter become exactly the same as the Maximum Likelihood Estimate (MLE)?

Bayesian interpretation of learning models Medium
A. When the prior distribution is a uniform distribution.
B. When the likelihood function is Gaussian.
C. When the dataset is very small.
D. When the posterior distribution is symmetric.

40 The logistic loss function for binary classification is derived from which principle?

Loss functions: squared error, logistic loss Medium
A. Minimizing the absolute error of the predicted probabilities.
B. Minimizing the squared distance between the prediction and the true label.
C. Finding the maximum margin hyperplane between classes.
D. Maximizing the likelihood of the data under a Bernoulli distribution assumption.

41 In Maximum Likelihood Estimation (MLE), finding parameters that maximize the likelihood is equivalent to minimizing a specific Kullback-Leibler (KL) divergence. Given the empirical data distribution and the model distribution , which KL divergence is minimized?

Likelihood estimation in ML context Hard
A.
B. , where is a standard normal prior
C.
D. The symmetrized KL Divergence:

42 Consider a binary classification problem using a sigmoid activation function . While squared error loss, , can be used, logistic loss (binary cross-entropy) is strongly preferred. From an optimization standpoint, what is the primary deficiency of using squared error loss in this context?

Loss functions: squared error, logistic loss Hard
A. It is an unbounded loss function, unlike logistic loss.
B. Its corresponding loss surface is non-convex, potentially having multiple local minima.
C. It is not differentiable with respect to the model weights.
D. It produces vanishingly small gradients for confidently misclassified samples, slowing learning.

43 In Maximum A Posteriori (MAP) estimation, the choice of prior distribution over the weights corresponds to a specific type of regularization. If the prior is a zero-mean Laplace distribution, , what form of regularization does this induce when minimizing the negative log-posterior?

Bayesian interpretation of learning models Hard
A. L2 Regularization (Ridge)
B. L1 Regularization (Lasso)
C. Dropout
D. Elastic Net Regularization

44 In topic modeling with Latent Dirichlet Allocation (LDA), a symmetric Dirichlet prior, , is placed on the per-document topic distributions. How does the hyperparameter influence the characteristics of the topic mixtures learned for documents?

Probability distributions used in learning algorithms Hard
A. Values of encourage dense, uniform-like mixtures, while encourages sparse topic mixtures.
B. controls the number of topics in the model.
C. only controls the variance of the topic distribution, not its sparsity.
D. Values of encourage sparse topic mixtures (few topics per document), while encourages dense, uniform-like mixtures.

45 In a Naive Bayes classifier, the 'naive' assumption concerns the conditional independence of feature random variables given the class random variable . Which mathematical statement correctly represents this core assumption?

Random variables in ML models Hard
A.
B.
C.
D.

46 Suppose your model is misspecified, e.g., you use MLE to fit a Gaussian model to data that was truly generated from a Laplace distribution . In the limit of infinite data, the MLE parameters will converge to the parameters of the Gaussian that is 'closest' to the true Laplace distribution. What measure of 'closeness' does MLE implicitly minimize?

Likelihood estimation in ML context Hard
A. The KL divergence
B. The Total Variation distance
C. The KL divergence
D. The L2 distance between the probability density functions

47 Comparing Huber loss, Logistic loss, and Squared Error loss in a classification context (), how would you rank their robustness to outliers, from most robust to least robust? An outlier is a point with a large error, e.g., a mislabeled point far from the boundary.

Loss functions: squared error, logistic loss Hard
A. All three have similar robustness as they are all convex.
B. Squared Error Loss > Logistic Loss > Huber Loss
C. Huber Loss > Logistic Loss > Squared Error Loss
D. Logistic Loss > Huber Loss > Squared Error Loss

48 In Bayesian modeling, a prior distribution is 'conjugate' to a likelihood if the resulting posterior is in the same probability distribution family as the prior. What is the primary computational advantage of using a conjugate prior?

Bayesian interpretation of learning models Hard
A. It yields a closed-form analytical expression for the posterior, avoiding the need for numerical approximation methods like MCMC.
B. It guarantees that the MAP estimate will be identical to the MLE.
C. It ensures the posterior distribution is symmetric and unimodal.
D. It simplifies the calculation of model gradients for optimization.

49 The reparameterization trick is essential for training Variational Autoencoders (VAEs). For a latent variable modeled by a diagonal Gaussian , how does this trick allow gradients to flow through the sampling step?

Probability distributions used in learning algorithms Hard
A. By using the score function estimator (REINFORCE) to estimate the gradient of the stochastic node.
B. By analytically integrating out the random variable from the loss function.
C. By replacing the sampling step with the mode of the distribution, .
D. By expressing as a deterministic function of parameters and a parameter-free random variable: , where .

50 Assuming a linear regression model where the target variable is modeled as a deterministic function of inputs plus Gaussian noise, , with . Maximizing the likelihood of this model with respect to the weights is equivalent to minimizing which loss function?

Loss functions: squared error, logistic loss Hard
A. Hinge Loss
B. Log-Cosh Loss
C. Mean Squared Error (MSE)
D. Mean Absolute Error (MAE)

51 Given i.i.d. samples from an Exponential distribution with PDF for , what is the Maximum Likelihood Estimate (MLE) for the rate parameter ?

Likelihood estimation in ML context Hard
A. The reciprocal of the sample variance
B. The sample mean,
C. The reciprocal of the sample mean,
D. The sample variance

52 The Maximum A Posteriori (MAP) estimate incorporates a prior belief about the parameters, while the Maximum Likelihood Estimate (MLE) does not. Under which specific condition on the prior distribution do the MAP and MLE estimates become identical?

Bayesian interpretation of learning models Hard
A. When the dataset size approaches infinity, for any valid prior.
B. When the prior is a uniform distribution over the parameter space.
C. When the posterior distribution is Gaussian.
D. When the likelihood function is from the exponential family.

53 A 2D random variable follows a multivariate Gaussian distribution with covariance matrix . Which statement about the relationship between the random variables and is correct?

Probability distributions used in learning algorithms Hard
A. and are positively correlated and are not independent.
B. and are independent because the distribution is Gaussian.
C. The marginal distribution of is not Gaussian.
D. and are negatively correlated.

54 The bias of an estimator for a true function is defined as , where the expectation is over all possible training datasets . What is the direct implication of an estimator being 'unbiased'?

Random variables in ML models Hard
A. The estimator's prediction for any single training dataset is equal to the true value.
B. The average of the estimator's predictions over all possible training datasets is equal to the true value.
C. The estimator is guaranteed to have the lowest possible Mean Squared Error.
D. The estimator has zero variance.

55 Minimizing the cross-entropy loss is a standard approach in classification. This is often described as being equivalent to minimizing the KL divergence . Why is this equivalence valid in the context of training a machine learning model?

Loss functions: squared error, logistic loss Hard
A. The equivalence only holds if the model output is a Gaussian distribution.
B. Because KL divergence is simply another name for cross-entropy.
C. Because the entropy of the true data distribution, , is a constant with respect to the model's parameters that determine .
D. Because cross-entropy is symmetric, so , just like KL divergence.

56 Early stopping is a regularization technique where training of an iterative algorithm (like a neural network) is halted when validation performance stops improving. What is the common Bayesian interpretation of this procedure?

Bayesian interpretation of learning models Hard
A. It is an approximation of performing MAP estimation with a Gaussian prior on the weights, where the prior's variance is implicitly controlled by the stopping time.
B. It corresponds to maximizing the marginal likelihood (model evidence) instead of the posterior.
C. It has no valid Bayesian interpretation and is considered a purely heuristic method.
D. It is equivalent to using a Laplace (L1) prior for inducing sparsity in the weights.

57 The invariance property of Maximum Likelihood Estimators (MLEs) states that if is the MLE for a parameter , and is a function, then the MLE for the transformed parameter is simply . Which of the following is a key condition for this property to hold?

Likelihood estimation in ML context Hard
A. The property only holds if the likelihood belongs to the exponential family.
B. The property holds for any function , it is not restricted to being one-to-one.
C. The property only holds if the function is linear.
D. The property only holds if the function is bijective (one-to-one and onto).

58 A Poisson process describes the number of events occurring in a fixed time interval, with an average rate of events per unit time. What is the probability distribution that models the waiting time between consecutive events in this process?

Probability distributions used in learning algorithms Hard
A. Poisson distribution with mean .
B. Exponential distribution with rate parameter .
C. Normal distribution with mean .
D. Gamma distribution with shape parameter 2.

59 The Fisher Information Matrix, , quantifies the information a random variable carries about an unknown parameter . It has a deep connection to the geometry of the likelihood surface. What is its relationship to the Hessian matrix, , of the negative log-likelihood function?

Loss functions: squared error, logistic loss Hard
A. The Fisher Information is the expectation of the Hessian of the negative log-likelihood.
B. The Fisher Information is the determinant of the Hessian.
C. The Fisher Information is always the identity matrix when the Hessian is positive definite.
D. The Fisher Information is the inverse of the Hessian.

60 In the evidence framework for Bayesian model selection, one maximizes the marginal likelihood to choose a model . How does this process naturally implement Occam's Razor?

Bayesian interpretation of learning models Hard
A. It forces the posterior distribution to be unimodal, favoring simpler explanations.
B. It penalizes overly complex models because they must spread their predictive probability over a larger space of possible datasets, reducing the probability assigned to the observed data.
C. It integrates out nuisance parameters, which is equivalent to L0 regularization.
D. It sets the prior probability of complex models, , to be exponentially lower.