1 $What is the primary characteristic of supervised machine learning?$

Introduction to supervised Learning Easy

A.

The algorithm's main goal is to reduce the number of features.

B.

The algorithm discovers patterns in unlabeled data.

C.

The algorithm learns through a system of rewards and punishments.

D.

The algorithm is trained on labeled data (input-output pairs).

2 $The basic Perceptron algorithm is a model for which type of machine learning task?$

Perceptron Easy

A.

Binary classification

B.

Clustering

C.

Dimensionality reduction

D.

Regression

3 $What kind of decision boundary does a single-layer Perceptron create?$

Perceptron Easy

A.

A linear decision boundary

B.

A probabilistic decision boundary

C.

A circular decision boundary

D.

A non-linear decision boundary

4 $Despite its name, Logistic Regression is primarily used for what kind of problem?$

Logistic Regression Easy

A.

Clustering

B.

Association rule learning

C.

Classification

D.

Regression

5 $Which function is used in Logistic Regression to map a real-valued number to a probability between 0 and 1?$

Logistic Regression Easy

A.

Step function

B.

Sigmoid function

C.

Linear function

D.

ReLU function

6 $The Naïve Bayes classifier is based on which mathematical theorem?$

Naïve Bayes Classifier Easy

A.

The Law of Large Numbers

B.

Pythagorean Theorem

C.

Bayes' Theorem

D.

The Central Limit Theorem

7 $What is the key 'naïve' assumption made by the Naïve Bayes classifier?$

Naïve Bayes Classifier Easy

A.

That the dataset has no missing values.

B.

That the model can only handle two classes.

C.

That the data follows a normal distribution.

D.

That all features are independent of each other, given the class.

8 $In the K-Nearest Neighbors (KNN) algorithm, what does the 'K' represent?$

K-Nearest Neighbors Easy

A.

The number of nearest neighbors to consider for classification.

B.

The number of classes in the target variable.

C.

The number of features in the dataset.

D.

A constant for the learning rate.

9 $KNN is often called a 'lazy learner' because:$

K-Nearest Neighbors Easy

A.

It only works on small datasets.

B.

It does very little work during the training phase.

C.

It is computationally inexpensive during prediction.

D.

It trains very slowly.

10 $In a classification decision tree, what does a leaf node typically represent?$

Decision Tree Easy

A.

A feature to test or split on.

B.

The root of the tree.

C.

The entire dataset.

D.

A class label or a final decision.

11 $What is a common criterion used to decide the best feature to split on at a node in a decision tree?$

Decision Tree Easy

A.

Mean Squared Error

B.

P-value

C.

Information Gain

D.

Euclidean Distance

12 $What is the primary objective of a Support Vector Machine (SVM) algorithm in classification?$

Support Vector Machine Easy

A.

To find the probability of a data point belonging to a class.

B.

To find the optimal hyperplane that maximally separates the classes.

C.

To build a tree-like model of decisions.

D.

To cluster data points into k groups.

13 $In the context of SVMs, what are the 'support vectors'?$

Support Vector Machine Easy

A.

The weights assigned to the features.

B.

The data points that are misclassified by the model.

C.

All the data points in the training set.

D.

The data points that are closest to the decision boundary (hyperplane).

14 $In a confusion matrix for binary classification, what does 'True Positive' (TP) mean?$

Model evaluation using confusion matrix, precision, recall, F1-Score Easy

A.

The model correctly predicted the negative class.

B.

The model incorrectly predicted the negative class when it was positive.

C.

The model correctly predicted the positive class.

D.

The model incorrectly predicted the positive class when it was negative.

15 $Which evaluation metric is calculated using the formula ?$

Model evaluation using confusion matrix, precision, recall, F1-Score Easy

A.

Precision

B.

Accuracy

C.

Recall

D.

F1-Score

16 $Recall (or Sensitivity) is defined by which of the following formulas?$

Model evaluation using confusion matrix, precision, recall, F1-Score Easy

A.

B.

C.

D.

17 $An ROC curve is a plot of the True Positive Rate against which other metric?$

Model evaluation using ROC Curve, ROC-AUC Easy

A.

False Positive Rate

B.

True Negative Rate

C.

Precision

D.

F1-Score

18 $An AUC-ROC score of 1.0 for a classifier indicates that...$

Model evaluation using ROC Curve, ROC-AUC Easy

A.

The classifier has high precision but low recall.

B.

The classifier is always wrong.

C.

The classifier is perfect.

D.

The classifier is no better than random guessing.

19 $What are the two axes of a Precision-Recall Curve?$

Precision-Recall Curve, AUC-PR Easy

A.

Precision and True Negative Rate

B.

True Positive Rate and False Positive Rate

C.

Accuracy and F1-Score

D.

Precision and Recall

20 $A Precision-Recall curve is generally considered more informative than an ROC curve in which scenario?$

Precision-Recall Curve, AUC-PR Easy

A.

When the dataset is perfectly balanced.

B.

When dealing with highly imbalanced datasets.

C.

When there are more than two classes.

D.

When the model is linear.

21 $The Perceptron learning algorithm is guaranteed to find a separating hyperplane for a binary classification problem under which specific condition?$

Perceptron Medium

A.

The dataset is perfectly balanced between the two classes.

B.

The data is linearly separable.

C.

The learning rate is sufficiently small.

D.

The features are normalized to have zero mean and unit variance.

22 $In a logistic regression model for predicting disease (1) vs. no disease (0), the coefficient for a feature smoking_years is . How should this be interpreted?$

Logistic Regression Medium

A.

For each additional year of smoking, the log-odds of having the disease decrease by 0.693.

B.

For each additional year of smoking, the odds of having the disease double.

C.

For each additional year of smoking, the probability of having the disease increases by 69.3%.

D.

The model is invalid because the coefficient is positive.

23 $The "naïve" assumption in the Naïve Bayes classifier is that all features are conditionally independent given the class. What is the primary consequence of this assumption being violated in practice?$

Naïve Bayes Classifier Medium

A.

The model will always perform worse than a logistic regression model.

B.

The model will fail to train and produce an error.

C.

The calculated posterior probabilities may be poorly calibrated (not accurate representations of the true likelihood), but the model can still be a good classifier.

D.

The decision boundary will become highly non-linear.

24 $A binary classifier's performance is summarized by the following confusion matrix: TP=80, FP=20, FN=40, TN=360. What is the F1-Score for the positive class?$

Model evaluation using confusion matrix, precision, recall, F1-Score Medium

A.

0.67

B.

0.90

C.

0.73

D.

0.80

25 $You have trained a K-Nearest Neighbors classifier. When setting the hyperparameter, the model is most likely to exhibit:$

K-Nearest Neighbors Medium

A.

Low bias and high variance

B.

High bias and high variance

C.

Low bias and low variance

D.

High bias and low variance

26 $For a binary classification task, if a node in a decision tree contains 40 samples of Class A and 40 samples of Class B, what is its Gini Impurity?$

Decision Tree Medium

A.

0.5

B.

0.0

C.

1.0

D.

0.25

27 $In a trained Support Vector Machine (SVM) classifier, if you remove a data point that is not a support vector, what is the expected impact on the decision boundary?$

Support Vector Machine Medium

A.

The margin will become narrower.

B.

The margin will become wider.

C.

The decision boundary will shift significantly.

D.

The decision boundary will remain unchanged.

28 $A classifier has an AUC-ROC score of 0.5. What does this score imply about the model's performance?$

Model evaluation using ROC Curve, ROC-AUC Medium

A.

The classifier has no discriminative power and is equivalent to random guessing.

B.

The classifier is perfect at distinguishing between the positive and negative classes.

C.

The classifier is perfectly incorrect; it systematically reverses the labels.

D.

The classifier has high precision but very low recall.

29 $Under which scenario is the Precision-Recall (PR) curve a more informative evaluation metric than the ROC curve?$

Precision-Recall Curve, AUC-PR Medium

A.

When the model being evaluated is a linear model.

B.

When the number of features is very large.

C.

When the dataset is perfectly balanced.

D.

When the dataset is highly imbalanced and the positive class is the minority.

30 $In designing a spam filter, the primary goal is to avoid filtering out important emails (false positives). Which evaluation metric should be prioritized for optimization?$

Model evaluation using confusion matrix, precision, recall, F1-Score Medium

A.

Precision

B.

Accuracy

C.

Negative Predictive Value

D.

Recall

31 $A decision tree model achieves 99% accuracy on the training set but only 70% on the test set. Which of the following is the most appropriate strategy to address this issue?$

Decision Tree Medium

A.

Apply pre-pruning by setting a maximum depth (max_depth) for the tree.

B.

Decrease the size of the training set.

C.

Increase the number of features used to train the model.

D.

Switch to Gini Impurity if Entropy was used, or vice versa.

32 $In a soft-margin SVM, what is the effect of increasing the regularization parameter C ?$

Support Vector Machine Medium

A.

It creates a narrower margin and reduces the number of margin violations (misclassifications).

B.

It converts the SVM into a hard-margin classifier, regardless of the data's separability.

C.

It creates a wider margin and allows for more margin violations.

D.

It only affects the training speed, with no impact on the final model.

33 $Why is feature scaling (e.g., standardization or normalization) a critical preprocessing step for the K-Nearest Neighbors algorithm?$

K-Nearest Neighbors Medium

A.

To ensure that features with larger scales do not disproportionately influence the distance calculations.

B.

To convert all categorical features into a numerical format.

C.

To reduce the memory footprint of the algorithm during prediction.

D.

To guarantee that the algorithm converges in a finite number of steps.

34 $What is the primary reason for applying a logistic (sigmoid) function in logistic regression?$

Logistic Regression Medium

A.

To map the linear output of the model to a probability value between 0 and 1.

B.

To handle missing values in the input features.

C.

To make the model robust to outliers.

D.

To create a non-linear decision boundary.

35 $You are building a Naïve Bayes classifier for text classification and encounter a word in your test data that was not present in the training data. Without any smoothing technique, what will be the posterior probability for any class?$

Naïve Bayes Classifier Medium

A.

The model will ignore that word in its calculation.

B.

The model will assign a very small, non-zero probability.

C.

The posterior probability will be zero.

D.

The posterior probability will be 0.5.

36 $Two models are evaluated on the same dataset. Model A's ROC curve is strictly above Model B's ROC curve for all thresholds. What can be definitively concluded?$

Model evaluation using ROC Curve, ROC-AUC Medium

A.

Model A has higher accuracy than Model B.

B.

Model A is computationally more complex than Model B.

C.

Model A dominates Model B, meaning it is a better classifier across all trade-offs between TPR and FPR.

D.

Model A will also have a higher AUC-PR score than Model B.

37 $Which of the following tasks is NOT an example of supervised learning?$

Introduction to supervised Learning Medium

A.

Grouping a store's customers into different market segments based on their purchasing history.

B.

Identifying whether a handwritten digit is a '7' or a '9' based on a labeled dataset of images.

C.

Predicting whether a bank loan application will be approved or denied based on historical data.

D.

Estimating the sale price of a house based on features like area, location, and number of bedrooms.

38 $Which of the following classification algorithms inherently creates a non-linear decision boundary without requiring kernel tricks or explicit feature engineering?$

Nonlinear & Distance-Based Models Medium

A.

Linear SVM

B.

Logistic Regression

C.

Perceptron

D.

Decision Tree

39 $On a Precision-Recall plot for a binary classification task with 20% positive instances, what does the no-skill or random baseline look like?$

Precision-Recall Curve, AUC-PR Medium

A.

A diagonal line from (0,1) to (1,0)

B.

A diagonal line from (0,0) to (1,1)

C.

A horizontal line at Precision = 0.20

D.

A horizontal line at Precision = 0.50

40 $The F1-Score is defined as the harmonic mean of precision and recall. Why is the harmonic mean used instead of a simple arithmetic mean?$

Model evaluation using confusion matrix, precision, recall, F1-Score Medium

A.

It gives more weight to precision than to recall.

B.

It gives a larger penalty to models when either precision or recall is very low.

C.

It is only valid for perfectly balanced datasets.

D.

It is computationally simpler than the arithmetic mean.

41 $In a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel,, what is the effect of decreasing the hyperparameter significantly (e.g., from 1.0 to 0.001), while keeping the regularization parameter constant?$

Support Vector Machine Hard

A.

The model becomes completely insensitive to the regularization parameter C.

B.

The decision boundary becomes smoother and closer to a linear boundary, increasing model bias.

C.

The decision boundary becomes more complex and wiggly, leading to a higher risk of overfitting.

D.

The number of support vectors will decrease drastically, regardless of the value of C.

42 $You are training a logistic regression model for a binary classification problem. During training, you observe that the weights are growing extremely large, and the algorithm fails to converge. The training data, when plotted, shows that the two classes are perfectly separable by a hyperplane. What is this phenomenon called, and what is the direct consequence?$

Logistic Regression Hard

A.

Multicollinearity; the model will have high variance and unstable coefficient estimates.

B.

The Hauck-Donner effect; the standard errors of the coefficients become infinitely large, making significance testing impossible.

C.

Complete separation; the maximum likelihood estimate for the weights does not exist because the sigmoid function approaches its asymptotes (0 and 1) but never reaches them.

D.

Underfitting; the model is too simple to capture the data, causing the optimization to diverge.

43 $You are comparing two models, Model A and Model B, on a binary classification task. You plot their ROC curves. The curves cross each other: Model A has a higher True Positive Rate (TPR) at low False Positive Rates (FPR), while Model B has a higher TPR at high FPRs. Their overall AUC-ROC scores are nearly identical (e.g., 0.85). Which statement is the most accurate for selecting a model?$

Model evaluation using ROC Curve, ROC-AUC Hard

A.

Model B is fundamentally better because it achieves a higher overall recall across a wider range of thresholds.

B.

Since the AUC-ROC is the same, both models are equally good and can be used interchangeably.

C.

Model A should be preferred if the cost of false positives is very high and we need a classifier with high precision at the top ranks.

D.

The crossing of ROC curves indicates that both models are unstable and a different algorithm should be chosen.

44 $In a text classification task using a Multinomial Naïve Bayes classifier, you notice that the model's predicted probabilities are poorly calibrated; they tend to be extreme (very close to 0 or 1). This happens even though the classification accuracy is reasonable. What is the most likely cause of this phenomenon related to the core assumption of the model?$

Naïve Bayes Classifier Hard

A.

The presence of many out-of-vocabulary words in the test set, leading to zero probabilities.

B.

The use of Laplace smoothing with a very large alpha value, which overly flattens the probability distributions.

C.

The feature vectors are not properly normalized using TF-IDF, leading to skewed probability estimates.

D.

The 'naïve' assumption of conditional independence between features (words) is strongly violated, causing the model to 'double-count' evidence.

45 $Consider a binary classification task where the data is perfectly separable by a diagonal line (e.g.,). How would the structure of a standard, unpruned decision tree (like CART) trained on this data likely look, and why?$

Decision Tree Hard

A.

A balanced tree with only two levels, perfectly capturing the linear relationship.

B.

The tree would fail to build because it cannot handle diagonal splits.

C.

A single node splitting on the feature that is more correlated with the target.

D.

A very deep, complex tree with a 'stair-step' decision boundary that approximates the diagonal line.

46 $You are working on a fraud detection system where only 0.1% of transactions are fraudulent (positive class). You train two classifiers. Model X has an AUC-ROC of 0.95. Model Y has an AUC-ROC of 0.90. However, the AUC-PR for Model X is 0.40, while the AUC-PR for Model Y is 0.60. Which model should you choose and why?$

Model evaluation using Precision-Recall Curve, AUC-PR Hard

A.

It's impossible to decide without knowing the specific precision and recall values at the chosen operational threshold.

B.

Model Y, because AUC-PR is a more informative metric than AUC-ROC on highly imbalanced datasets, and a higher AUC-PR indicates better performance in identifying the rare positive class.

C.

Neither model is acceptable, as an AUC-PR of 0.60 is too low for a production system.

D.

Model X, because its AUC-ROC is significantly higher, indicating better overall discrimination ability.

47 $The Perceptron Convergence Theorem guarantees that the Perceptron learning algorithm will find a separating hyperplane in a finite number of steps. What is a key condition for this theorem to hold, and what happens if this condition is violated?$

Perceptron Hard

A.

The data must be linearly separable; if it is not, the algorithm will not converge and the weights will oscillate indefinitely.

B.

The learning rate must be very small; if it's too large, the algorithm will oscillate and never converge.

C.

The initial weights must be set to zero; if they are initialized randomly, convergence is not guaranteed.

D.

The data must be normalized; otherwise, features with larger scales will dominate the weight updates.

48 $Why does the performance of a K-Nearest Neighbors (KNN) classifier often degrade as the dimensionality of the feature space increases, even if the additional dimensions contain useful information? This phenomenon is often referred to as the 'Curse of Dimensionality'.$

K-Nearest Neighbors Hard

A.

Standard distance metrics like Euclidean distance are not mathematically defined for spaces with more than 100 dimensions.

B.

In high-dimensional spaces, the concept of 'neighborhood' becomes less meaningful because the distance to the nearest neighbor approaches the distance to the farthest neighbor.

C.

The computational cost of calculating distances becomes prohibitively expensive in high dimensions.

D.

High dimensionality inevitably introduces multicollinearity, which KNN is unable to handle.

49 $In a 3-class classification problem (Classes A, B, C), a model is evaluated. The number of samples are: A=1000, B=100, C=10. The model's performance is reported with a 'micro-averaged F1-score' of 0.90 and a 'macro-averaged F1-score' of 0.60. What does this discrepancy most likely indicate?$

Model evaluation using confusion matrix, precision, recall, F1-Score Hard

A.

The model is equally good/bad across all classes, and the difference is due to a calculation error.

B.

The model performs poorly on the majority class (A) but very well on the minority classes (B and C).

C.

Micro-F1 is always higher than Macro-F1, so this is expected behavior for any imbalanced dataset.

D.

The model performs exceptionally well on the majority class (A) but poorly on the minority classes (B and C).

50 $You are building a logistic regression model and decide to use L1 regularization (Lasso). You have two features, and, that are highly correlated (e.g., temperature in Celsius and Fahrenheit). How will L1 regularization likely treat the coefficients for these two features compared to L2 (Ridge) regularization?$

Logistic Regression Hard

A.

L2 will select one feature and shrink the other to zero, while L1 will keep both but with smaller magnitudes.

B.

Both L1 and L2 will shrink one of the feature coefficients to exactly zero to eliminate the redundancy.

C.

L1 will assign large, equal, and opposite-signed coefficients to both features, while L2 will assign them small, similar coefficients.

D.

L1 will arbitrarily select one feature and shrink its coefficient to zero while keeping the other, whereas L2 will shrink both coefficients towards zero but keep them both non-zero.

51 $What is the primary motivation for using the dual formulation to solve the optimization problem for a Support Vector Machine, especially when using the kernel trick?$

Support Vector Machine Hard

A.

The dual problem is always a convex optimization problem, whereas the primal is not.

B.

The dual formulation's complexity depends on the number of features, making it efficient for datasets with many samples but few features.

C.

The dual problem requires fewer constraints and is therefore simpler for standard optimization libraries to solve.

D.

The dual formulation expresses the optimization in terms of dot products of feature vectors (), which allows for the substitution of these dot products with a kernel function without ever explicitly mapping to the high-dimensional space.

52 $You train a decision tree on a dataset and then rotate the entire dataset by 45 degrees. You then train a new decision tree with the same hyperparameters on this rotated data. How will the performance and the structure of the new tree likely compare to the original one?$

Decision Tree Hard

A.

The performance will likely be significantly worse, and the tree will be much deeper and more complex.

B.

The performance will be identical, but the tree structure will be simpler.

C.

The performance and structure will be identical because rotation is a linear transformation.

D.

The performance will be significantly better because the rotation might align the classes with the axes.

53 $Consider using a Gaussian Naïve Bayes classifier for a binary classification problem with two continuous features. The true decision boundary between the two classes is a circle. What shape will the decision boundary learned by the Gaussian Naïve Bayes classifier have, and why?$

Naïve Bayes Classifier Hard

A.

A circle or an ellipse, as it can model the covariance between features.

B.

Always a straight line, as Naïve Bayes is a linear classifier.

C.

A linear or quadratic curve (like a line, parabola, hyperbola, or ellipse), because the log-ratio of the class posteriors forms a quadratic function of the features.

D.

A set of axis-aligned rectangles, similar to a decision tree.

54 $You are applying KNN to a dataset with a mix of numerical and categorical features. A common approach for handling categorical features is to use one-hot encoding. What is a significant drawback of this approach, particularly in combination with the standard Euclidean distance metric?$

K-Nearest Neighbors Hard

A.

One-hot encoding is computationally inefficient and cannot be processed by KNN algorithms.

B.

It can cause the categorical features to disproportionately dominate the distance calculation, as a single categorical difference results in a large squared Euclidean distance.

C.

It is not a valid technique; categorical features must be encoded as integers.

D.

One-hot encoding reduces the dimensionality of the dataset, leading to information loss.

55 $A binary classifier is built which, due to a bug, outputs the exact same prediction probability (e.g., 0.6) for every single instance in the test set. What will the Area Under the ROC Curve (AUC-ROC) for this classifier be?$

Model evaluation using ROC Curve, ROC-AUC Hard

A.

1.0

B.

0.5

C.

0.0

D.

Undefined, as the curve cannot be plotted.

56 $How does the 'Pocket Algorithm' modify the standard Perceptron algorithm to handle non-linearly separable data, and what is its primary trade-off?$

Perceptron Hard

A.

It uses a kernel function to map the data into a higher dimension where it becomes separable.

B.

It adds a regularization term to the update rule, preventing the weights from growing too large.

C.

It keeps track of the best weight vector found so far (the one with the fewest misclassifications) in its 'pocket' and returns that vector at the end, rather than the final one.

D.

It uses a variable learning rate that decreases over time, allowing it to settle in a local minimum.

57 $In the context of a soft-margin SVM, what is the primary role of the slack variables in the primal optimization problem formulation?$

Support Vector Machine Hard

A.

They allow the optimization to find a separating hyperplane by permitting some data points to be on the wrong side of the margin or even the wrong side of the decision boundary.

B.

They are Lagrange multipliers used to enforce the constraints of the optimization problem.

C.

They measure the distance from a correctly classified point to the decision boundary.

D.

They represent the support vectors and are non-zero only for those points.

58 $Which of the following scenarios best illustrates the bias-variance trade-off in the context of choosing between a simple linear model (like Logistic Regression) and a complex non-linear model (like a deep Decision Tree)?$

Introduction to supervised Learning Hard

A.

The linear model, being simpler, is likely to have high bias and low variance, underfitting the data. The complex decision tree is likely to have low bias and high variance, overfitting the data.

B.

The linear model will have high bias and high variance on small datasets, while the decision tree will have low bias and low variance on large datasets.

C.

The bias-variance trade-off implies that increasing the amount of training data will always decrease both bias and variance for both models.

D.

The linear model has low bias and low variance, while the decision tree has high bias and high variance.

59 $In a logistic regression model, the output is the probability of the positive class, given by the sigmoid function, where . What is the mathematical expression for the gradient of the binary cross-entropy (log loss) function with respect to the linear combination ?$

Logistic Regression Hard

A.

B.

C.

D.

60 $Consider a binary classification problem where a parent node contains 50 samples of Class 0 and 50 samples of Class 1. A proposed split sends all 50 samples of Class 0 to the left child node and all 50 samples of Class 1 to the right child node. What is the Information Gain of this split, using base-2 logarithm?$

Decision Tree Hard

A.

1 bit

B.

0 bits

C.

2 bits

D.

0.5 bits

Unit 3 - Practice Quiz