1In the context of supervised learning, what is the primary goal of Classification?
A.To predict a continuous numerical value based on input features.
B.To group similar data points together without predefined labels.
C.To map input variables to discrete output categories or classes.
D.To reduce the dimensionality of the dataset.
Correct Answer: To map input variables to discrete output categories or classes.
Explanation:Classification involves identifying which of a set of categories (sub-populations) a new observation belongs to, on the basis of a training set of data containing observations (or instances) whose category membership is known. This distinguishes it from regression, which predicts continuous values.
Incorrect! Try again.
2What does a Decision Boundary represent in a classification problem?
A.The limit of the computational power required to train the model.
B.A hypersurface that partitions the underlying vector space into two or more sets, one for each class.
C.The boundary where the training data ends and the testing data begins.
D.The maximum error rate acceptable for the model.
Correct Answer: A hypersurface that partitions the underlying vector space into two or more sets, one for each class.
Explanation:A decision boundary is the hypersurface (line in 2D, plane in 3D, etc.) that separates different classes in the feature space. On one side of the boundary, data points are classified as one class, and on the other side, a different class.
Incorrect! Try again.
3Consider a linear classifier in a 2-dimensional feature space defined by . What geometric shape is the decision boundary?
A.A parabola
B.A circle
C.A straight line
D.A hyperbola
Correct Answer: A straight line
Explanation:In 2D space, a linear equation of the form represents a straight line. In 3D it is a plane, and in higher dimensions, it is a hyperplane.
Incorrect! Try again.
4For a Linear Classifier, the decision rule is often given by . What is the role of the bias term ?
A.It rotates the decision boundary around the origin.
B.It scales the length of the weight vector .
C.It translates the decision boundary away from the origin.
D.It creates non-linear curves in the boundary.
Correct Answer: It translates the decision boundary away from the origin.
Explanation:The weight vector determines the orientation of the decision boundary, while the bias term allows the decision boundary to be offset (translated) from the origin. Without , the boundary would always pass through the origin.
Incorrect! Try again.
5In the One-vs-All (One-vs-Rest) strategy for multi-class classification with classes, how many binary classifiers are trained?
A.
B.
C.
D.
Correct Answer:
Explanation:In One-vs-All, we train separate binary classifiers. The -th classifier distinguishes class from all other classes combined.
Incorrect! Try again.
6In the One-vs-One strategy for multi-class classification with classes, how many binary classifiers are trained?
A.
B.
C.
D.
Correct Answer:
Explanation:One-vs-One builds a classifier for every pair of classes. The number of unique pairs in classes is given by the combination formula .
Incorrect! Try again.
7When using the One-vs-One strategy, how is the final classification decision typically made for a new data point?
A.By selecting the class with the highest probability from a single classifier.
B.By averaging the regression outputs of all classifiers.
C.By a voting scheme where the class with the most 'wins' is selected.
D.By choosing the class that was trained last.
Correct Answer: By a voting scheme where the class with the most 'wins' is selected.
Explanation:In OvO, each of the classifiers casts a vote for one of the two classes it distinguishes. The class that receives the maximum number of votes is predicted.
Incorrect! Try again.
8Which of the following is a potential disadvantage of the One-vs-All strategy when classes are imbalanced?
A.It requires too many classifiers to be trained.
B.The datasets for the binary classifiers become heavily skewed (e.g., 1 vs 99 others).
C.It cannot handle linear decision boundaries.
D.It is computationally more expensive than One-vs-One during inference.
Correct Answer: The datasets for the binary classifiers become heavily skewed (e.g., 1 vs 99 others).
Explanation:In OvA, the negative class includes samples from all other classes. If there are many classes, the negative set will vary largely outnumber the positive set for a specific class classifier, leading to class imbalance problems during training.
Incorrect! Try again.
9In Bayes Theorem, given by , what is the term called?
A.Posterior
B.Prior
C.Likelihood
D.Evidence
Correct Answer: Likelihood
Explanation: is the Likelihood (or Class-Conditional Probability), representing the probability of observing feature given that the class is .
Incorrect! Try again.
10In Bayes Theorem, , what does represent?
A.The probability of the data occurring regardless of the class.
B.The probability of the class before observing any data (Prior).
C.The probability of the class after observing the data (Posterior).
D.The conditional dependence of on .
Correct Answer: The probability of the class before observing any data (Prior).
Explanation: is the Prior probability, reflecting our initial belief about the probability of class occurring before seeing any specific features.
Incorrect! Try again.
11What is the fundamental assumption of the Naïve Bayes classifier?
A.Features are dependent on each other given the class label.
B.All features contribute equally to the decision boundary regardless of the class.
C.Features are conditionally independent given the class label.
D.The prior probabilities of all classes are equal.
Correct Answer: Features are conditionally independent given the class label.
Explanation:The 'Naïve' assumption is that the value of a particular feature is independent of the value of any other feature, given the class variable. Mathematically: .
Incorrect! Try again.
12Which of the following equations represents the decision rule for a Naïve Bayes classifier (ignoring the evidence as it is constant for all classes)?
A.
B.
C.
D.
Correct Answer:
Explanation:According to the MAP (Maximum A Posteriori) estimation and the independence assumption, we select the class that maximizes the product of the prior and the individual likelihoods .
Incorrect! Try again.
13In a Naïve Bayes classifier, what is the Zero Frequency Problem?
A.When the prior probability of a class is zero.
B.When a feature value appears in the test set but was never observed with a specific class in the training set, resulting in a zero likelihood.
C.When the entire dataset has zero variance.
D.When the computation results in a divide-by-zero error during normalization.
Correct Answer: When a feature value appears in the test set but was never observed with a specific class in the training set, resulting in a zero likelihood.
Explanation:If a categorical variable has a category in the test data that was not observed in the training data for a specific class, the probability becomes 0. Since probabilities are multiplied, this makes the entire posterior probability 0.
Incorrect! Try again.
14What technique is commonly used to solve the Zero Frequency Problem in Naïve Bayes?
Explanation:Laplace smoothing adds a small count (usually 1 or ) to each variable count in the training data to ensure that no probability is ever exactly zero.
Incorrect! Try again.
15Which variation of Naïve Bayes is most appropriate when feature values are continuous and assumed to follow a normal distribution?
A.Multinomial Naïve Bayes
B.Bernoulli Naïve Bayes
C.Gaussian Naïve Bayes
D.Poisson Naïve Bayes
Correct Answer: Gaussian Naïve Bayes
Explanation:Gaussian Naïve Bayes assumes that the continuous values associated with each class are distributed according to a Gaussian (Normal) distribution.
Incorrect! Try again.
16In Bayesian Decision Theory, the concept of Risk is defined as:
A.The probability of choosing the wrong class.
B.The expected loss associated with a decision rule.
C.The computational complexity of the algorithm.
D.The inverse of the likelihood function.
Correct Answer: The expected loss associated with a decision rule.
Explanation:Risk is the expected value of the loss function. Ideally, a decision rule is chosen to minimize this total risk.
Incorrect! Try again.
17If we use the Zero-One Loss function (loss is 0 for correct classification, 1 for incorrect), minimizing the Risk is equivalent to:
A.Minimizing the squared error.
B.Maximizing the likelihood.
C.Minimizing the probability of error.
D.Maximizing the entropy.
Correct Answer: Minimizing the probability of error.
Explanation:With Zero-One loss, the risk corresponds exactly to the probability of misclassification. Therefore, the optimal strategy (Bayes decision rule) is to select the class with the highest posterior probability to minimize error.
Incorrect! Try again.
18Given the formula for Gaussian Naïve Bayes likelihood: , what parameters need to be estimated from the training data?
A.The weights and bias .
B.The mean and variance for each class and feature.
C.The median and mode.
D.The min and max values of .
Correct Answer: The mean and variance for each class and feature.
Explanation:Gaussian Naïve Bayes parametrizes the likelihood using the mean and variance of the feature specific to class .
Incorrect! Try again.
19Why is the Naïve Bayes classifier considered a Generative Model?
A.Because it generates new training data to balance classes.
B.Because it models the joint probability (via ) and captures how the data is generated.
C.Because it directly learns the decision boundary without modeling densities.
D.Because it uses genetic algorithms for optimization.
Correct Answer: Because it models the joint probability (via ) and captures how the data is generated.
Explanation:Generative models learn the distribution of individual classes (the likelihood and prior) and can essentially 'generate' samples from the learned distribution. Discriminative models (like Logistic Regression) model directly.
Incorrect! Try again.
20In the context of multi-class classification, if the decision regions are separated by linear boundaries, the classifier is known as:
A.A Linear Classifier
B.A Quadratic Classifier
C.A Non-parametric Classifier
D.A Decision Tree
Correct Answer: A Linear Classifier
Explanation:A linear classifier makes decisions based on the value of a linear combination of the characteristics. This results in decision boundaries that are hyperplanes (linear).
Incorrect! Try again.
21Which version of Naïve Bayes is best suited for binary feature vectors (e.g., word presence/absence in text classification)?
A.Gaussian Naïve Bayes
B.Multinomial Naïve Bayes
C.Bernoulli Naïve Bayes
D.Linear Naïve Bayes
Correct Answer: Bernoulli Naïve Bayes
Explanation:Bernoulli Naïve Bayes is designed for binary/boolean features. It explicitly models the absence of terms as well as their presence.
Incorrect! Try again.
22What is MAP estimation in the context of Bayesian classification?
A.Maximum Average Precision
B.Minimum Absolute Posteriori
C.Maximum A Posteriori
D.Mean Average Probability
Correct Answer: Maximum A Posteriori
Explanation:MAP stands for Maximum A Posteriori. It is the method of estimating the variable (class) that maximizes the posterior probability .
Incorrect! Try again.
23How does Maximum Likelihood (ML) estimation differ from MAP estimation?
A.ML assumes a uniform prior (or ignores the prior), while MAP accounts for the prior .
B.ML is for regression, MAP is for classification.
C.MAP assumes a uniform likelihood, ML calculates likelihood.
D.There is no difference; they are identical.
Correct Answer: ML assumes a uniform prior (or ignores the prior), while MAP accounts for the prior .
Explanation:MAP maximizes , whereas ML maximizes only . If the prior is uniform (all classes equally likely), MAP reduces to ML.
Incorrect! Try again.
24What is the computational complexity of predicting a class for a single instance using Naïve Bayes with features and classes?
A.
B.
C.
D.
Correct Answer:
Explanation:For each of the classes, we multiply probabilities (one for each feature). Thus, the complexity is linear with respect to both the number of classes and features.
Incorrect! Try again.
25In Bayesian Decision Theory, typically denotes:
A.The probability of class .
B.The loss incurred by taking action when the true state of nature is .
C.The likelihood of feature .
D.The learning rate.
Correct Answer: The loss incurred by taking action when the true state of nature is .
Explanation:This notation represents the Loss Function matrix, quantifying the cost of making a specific decision (action) given the actual ground truth.
Incorrect! Try again.
26Which of the following is true regarding the Decision Boundary of a Gaussian Naïve Bayes classifier if all classes share the same covariance matrix?
A.The boundary is quadratic.
B.The boundary is linear.
C.The boundary is circular.
D.There is no decision boundary.
Correct Answer: The boundary is linear.
Explanation:If the covariance matrices are identical for all classes, the quadratic terms in the discriminant functions cancel out, leaving a linear function of . Thus, it behaves like a linear classifier (similar to Linear Discriminant Analysis).
Incorrect! Try again.
27Why do we often work with Log-Probabilities (sums of logs) instead of direct probabilities (products) in Naïve Bayes?
A.To make the math harder.
B.Because logs are always positive.
C.To prevent numerical underflow when multiplying many small probabilities.
D.Because log probabilities are required by the Bayes theorem definition.
Correct Answer: To prevent numerical underflow when multiplying many small probabilities.
Explanation:Probabilities are . Multiplying many of them results in extremely small numbers that can vanish (underflow) in floating-point arithmetic. Summing logs () avoids this and preserves the ranking order (since log is monotonic).
Incorrect! Try again.
28Which of the following text classification scenarios is Multinomial Naïve Bayes typically used for?
A.When features represent the presence/absence of words (binary).
B.When features represent word counts or term frequencies.
C.When features are continuous word embeddings.
D.When the text length is infinite.
Correct Answer: When features represent word counts or term frequencies.
Explanation:Multinomial NB models the distribution of counts (frequencies) of events (words) generated from a multinomial distribution.
Incorrect! Try again.
29A classifier that distinguishes between 'Spam' and 'Not Spam' is an example of:
A.Clustering
B.Regression
C.Binary Classification
D.Reinforcement Learning
Correct Answer: Binary Classification
Explanation:There are exactly two possible output classes, making it a binary classification problem.
Incorrect! Try again.
30In the formulation , if and , which side of the boundary does the point fall on?
A.Positive side ()
B.Negative side ()
C.On the boundary ()
D.Undefined
Correct Answer: Positive side ()
Explanation:Substitute the values: . Since , it falls on the positive side.
Incorrect! Try again.
31In Bayesian Decision Theory, the Evidence acts as a:
A.Weighting factor for the likelihood.
B.Normalization constant to ensure probabilities sum to 1.
C.Prior belief about the feature distribution.
D.Loss function.
Correct Answer: Normalization constant to ensure probabilities sum to 1.
Explanation:The evidence scales the numerator () so that the posterior probabilities over all classes sum to 1. It does not affect the ranking of classes.
Incorrect! Try again.
32Which of the following statements about Decision Regions is correct?
A.Decision regions can never be disjoint.
B.The union of all decision regions must cover the entire feature space.
C.Decision regions must always be convex.
D.Decision regions are only defined for training data.
Correct Answer: The union of all decision regions must cover the entire feature space.
Explanation:A classifier assigns every point in the feature space to a class (or a reject option), so the regions partition the space.
Incorrect! Try again.
33The One-vs-One strategy generally requires more space to store models than One-vs-All ( vs ). Why might it still be preferred?
A.Each individual classifier is trained on a smaller subset of data (only two classes), potentially making training faster.
B.It is the only method that supports Neural Networks.
C.It does not require labels.
D.It guarantees 100% accuracy.
Correct Answer: Each individual classifier is trained on a smaller subset of data (only two classes), potentially making training faster.
Explanation:Although there are more classifiers, each one only looks at data from two classes. For algorithms that scale super-linearly with data size (like SVMs), training many small models is often faster than training fewer large models (OvA).
Incorrect! Try again.
34In a probabilistic classifier, if and , and the loss for misclassifying class 1 is much higher than misclassifying class 2, Bayesian Decision Theory might suggest:
A.Choosing class 1 regardless of cost.
B.Choosing class 2 if the expected risk is lower, even if the probability is lower.
C.Choosing the class with the highest probability always.
D.Refusing to classify.
Correct Answer: Choosing class 2 if the expected risk is lower, even if the probability is lower.
Explanation:Bayesian Decision Theory minimizes Risk, not just error rate. If the cost (loss) of a specific error is very high, the optimal decision might be to select a less probable class to avoid that high cost.
Incorrect! Try again.
35What is the vector in a linear classifier geometrically orthogonal to?
A.The x-axis.
B.The y-axis.
C.The decision boundary (hyperplane).
D.The data points.
Correct Answer: The decision boundary (hyperplane).
Explanation:The weight vector is the normal vector to the separating hyperplane defined by .
Explanation:This is the formula for calculating the probability of a word given a class in Multinomial Naïve Bayes with Laplace smoothing (Add-one smoothing).
Incorrect! Try again.
37Why does the Naïve Bayes independence assumption often work well in practice even when features are somewhat dependent?
A.Because dependencies cancel each other out.
B.Because classification relies on the correct sign/ranking of the posterior, not the exact probability value.
C.Because real-world data is always independent.
D.Because the algorithm corrects the dependencies during training.
Correct Answer: Because classification relies on the correct sign/ranking of the posterior, not the exact probability value.
Explanation:Even if the probability estimates are biased due to the independence assumption, the classifier will still classify correctly as long as the correct class remains the one with the highest (albeit inaccurate) probability score.
Incorrect! Try again.
38If a classifier produces a probability , it is known as a:
A.Hard Classifier
B.Soft (Probabilistic) Classifier
C.Deterministic Classifier
D.Regressive Classifier
Correct Answer: Soft (Probabilistic) Classifier
Explanation:Soft classifiers output the probabilities of class membership, whereas hard classifiers output just the label.
Incorrect! Try again.
39In the context of Bayes Theorem, if the Prior is uniform for all classes, the MAP estimate is equivalent to maximizing:
A.The Evidence
B.The Likelihood
C.The Loss Function
D.The Variance
Correct Answer: The Likelihood
Explanation:If is constant, maximizing is the same as maximizing .
Incorrect! Try again.
40What is the dimension of the decision boundary for a binary classification problem with 10 input features?
A.1
B.2
C.9
D.10
Correct Answer: 9
Explanation:The decision boundary is a hyperplane of dimension , where is the dimension of the feature space. Here .
Incorrect! Try again.
41Which of the following is NOT a property of a Linear Classifier?
A.Computationally efficient (fast inference).
B.Simple to interpret (weights indicate feature importance).
C.Can learn complex XOR relationships directly without feature engineering.
D.Less prone to overfitting compared to high-degree polynomial classifiers.
Correct Answer: Can learn complex XOR relationships directly without feature engineering.
Explanation:Linear classifiers cannot solve the XOR problem because the XOR classes are not linearly separable.
Incorrect! Try again.
42When applying Naïve Bayes, how is the Prior usually estimated from training data?
A.Average value of features for class .
B.Fraction of training samples belonging to class .
C.Correlation coefficient of class .
D.It is always set to 0.5.
Correct Answer: Fraction of training samples belonging to class .
Explanation:, where is the count of samples in class and is total samples.
Incorrect! Try again.
43Which statement best describes the Bayes Error Rate?
A.The error rate of a Naïve Bayes classifier.
B.The lowest possible error rate for any classifier on a given distribution.
C.The error rate when is ignored.
D.The rate at which the algorithm converges.
Correct Answer: The lowest possible error rate for any classifier on a given distribution.
Explanation:The Bayes Error Rate is the irreducible error inherent in the data distribution (due to noise/overlap of classes). No classifier can outperform the theoretical Bayes Optimal Classifier.
Incorrect! Try again.
44In a 3-class problem using One-vs-Rest, if the outputs of the three classifiers for a point are , , , which class is predicted?
A.Class 1
B.Class 2
C.Class 3
D.None
Correct Answer: Class 3
Explanation:In One-vs-Rest, we choose the class corresponding to the classifier with the highest confidence score (signed distance). $1.5$ is the maximum.
Incorrect! Try again.
45If features and are duplicates (), how does this affect Naïve Bayes?
A.It has no effect.
B.It improves accuracy by reinforcing the signal.
C.It violates the independence assumption and 'double counts' the importance of that feature.
D.It causes a division by zero.
Correct Answer: It violates the independence assumption and 'double counts' the importance of that feature.
Explanation:Naïve Bayes multiplies probabilities. If a feature is duplicated, its probability is multiplied twice, effectively squaring it and giving that feature undue weight (double counting), which skews the posterior.
Incorrect! Try again.
46What is a Reject Option in classification?
A.Removing outliers from the training set.
B.Refraining from making a prediction if the posterior probability is below a certain threshold.
C.Rejecting the null hypothesis.
D.Deleting features that are not useful.
Correct Answer: Refraining from making a prediction if the posterior probability is below a certain threshold.
Explanation:In ambiguity (low confidence in all classes), a classifier may choose to 'reject' the input (e.g., send it for human review) rather than making a likely incorrect guess.
Incorrect! Try again.
47Geometrically, what does the likelihood in a Gaussian Naïve Bayes represent?
A.The distance of from the decision boundary.
B.The density of the data point within the cluster of class .
C.The probability of the class .
D.The volume of the dataset.
Correct Answer: The density of the data point within the cluster of class .
Explanation:For a Gaussian distribution, the likelihood indicates how close the point is to the mean of the class distribution, scaled by the variance.
Incorrect! Try again.
48Which of the following is an example of a Discriminative approach to classification?
A.Naïve Bayes
B.Logistic Regression
C.Hidden Markov Models
D.Gaussian Mixture Models
Correct Answer: Logistic Regression
Explanation:Logistic Regression models the posterior directly (Discriminative), whereas the others listed are Generative models (modeling ).
Incorrect! Try again.
49In the context of the One-vs-One strategy, if there is a tie in the voting (e.g., class A and class B both get same number of votes), how is it typically resolved?
A.Random selection or based on highest aggregate confidence score.
B.The model crashes.
C.Both classes are returned.
D.The process is restarted.
Correct Answer: Random selection or based on highest aggregate confidence score.
Explanation:Ties are usually broken by either summing the raw decision function values (confidence) for the tied classes or simply picking one randomly/lexicographically.
Incorrect! Try again.
50What happens to the decision boundary in a Linear Classifier if we multiply all weights and bias by a positive constant ?
A.The boundary shifts.
B.The boundary rotates.
C.The boundary remains unchanged.
D.The boundary becomes non-linear.
Correct Answer: The boundary remains unchanged.
Explanation:The equation is equivalent to (since $c
eq 0$). The hyperplane geometry is identical, though the magnitude of the output scores changes.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.