1In the context of supervised learning, what distinguishes a classification problem from a regression problem?
A.The training data is unlabeled.
B.The target variable is categorical or discrete.
C.The input variables are continuous.
D.The target variable is continuous.
Correct Answer: The target variable is categorical or discrete.
Explanation:
Classification involves predicting a discrete class label (e.g., Spam/Not Spam), whereas regression involves predicting a continuous quantity.
Incorrect! Try again.
2Which scikit-learn method is primarily used to train a classifier on a dataset ?
A.model.transform(X, y)
B.model.fit(X, y)
C.model.predict(X, y)
D.model.score(X, y)
Correct Answer: model.fit(X, y)
Explanation:
In scikit-learn, the fit method is used to train the model parameters using the training data and labels .
Incorrect! Try again.
3What is the standard shape of the input matrix expected by scikit-learn classifiers?
A.(n_features, n_classes)
B.(n_samples, n_samples)
C.(n_samples, n_features)
D.(n_features, n_samples)
Correct Answer: (n_samples, n_features)
Explanation:
Scikit-learn expects data in a 2D array where rows represent samples and columns represent features.
Incorrect! Try again.
4Which of the following metrics is defined as the ratio of correctly predicted observations to the total observations?
A.Accuracy
B.Precision
C.F1-Score
D.Recall
Correct Answer: Accuracy
Explanation:
Accuracy is calculated as .
Incorrect! Try again.
5In a Confusion Matrix, what does a False Positive (FP) represent?
A.The model predicted Negative, and the actual class was Positive.
B.The model predicted Positive, and the actual class was Positive.
C.The model predicted Negative, and the actual class was Negative.
D.The model predicted Positive, but the actual class was Negative.
Correct Answer: The model predicted Positive, but the actual class was Negative.
Explanation:
A False Positive is a 'Type I error' where the model incorrectly predicts the positive class for a negative instance.
Incorrect! Try again.
6Which metric is best suited for a classification problem where False Negatives are much more costly than False Positives (e.g., detecting a deadly disease)?
A.Precision
B.Accuracy
C.Specificity
D.Recall
Correct Answer: Recall
Explanation:
Recall (Sensitivity) measures the proportion of actual positives identified. High recall minimizes false negatives.
Incorrect! Try again.
7Calculate the Precision given: , , .
A.0.91
B.0.83
C.0.20
D.0.50
Correct Answer: 0.83
Explanation:
Precision is .
Incorrect! Try again.
8The F1-Score is the harmonic mean of which two metrics?
A.Precision and Recall
B.TPR and FPR
C.Specificity and Sensitivity
D.Accuracy and Recall
Correct Answer: Precision and Recall
Explanation:
The F1-Score balances Precision and Recall: .
Incorrect! Try again.
9In an ROC curve, the x-axis and y-axis represent which metrics respectively?
A.True Negative Rate vs True Positive Rate
B.False Positive Rate vs True Positive Rate
C.Precision vs Recall
D.Recall vs Accuracy
Correct Answer: False Positive Rate vs True Positive Rate
Explanation:
The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Sensitivity) against the False Positive Rate ().
Incorrect! Try again.
10What does an AUC (Area Under Curve) score of 0.5 imply about a classifier?
A.It is a perfect classifier.
B.It predicts the negative class always.
C.It has zero errors.
D.It performs no better than random guessing.
Correct Answer: It performs no better than random guessing.
Explanation:
An AUC of 0.5 represents a model with no discrimination capacity, equivalent to flipping a coin.
Incorrect! Try again.
11When using classification_report in scikit-learn, what does the macro avg represent?
A.The unweighted mean of the metric for each label.
B.The weighted average based on support size.
C.The standard deviation of the metric.
D.The accuracy of the model.
Correct Answer: The unweighted mean of the metric for each label.
Explanation:
Macro average calculates metrics for each class independently and then takes the average, treating all classes equally regardless of imbalance.
Incorrect! Try again.
12Which issue makes Accuracy a misleading metric?
A.Imbalanced datasets.
B.High computational cost.
C.Linearly separable data.
D.It cannot be calculated for multiclass problems.
Correct Answer: Imbalanced datasets.
Explanation:
In highly imbalanced datasets (e.g., 99% Class A, 1% Class B), a model predicting only Class A achieves 99% accuracy but is useless.
Incorrect! Try again.
13What is the activation function used in the standard Perceptron algorithm for binary classification?
A.Tanh function
B.Sigmoid function
C.Heaviside step function
D.ReLU function
Correct Answer: Heaviside step function
Explanation:
The standard Perceptron uses a step function that outputs 1 if and 0 (or -1) otherwise.
Incorrect! Try again.
14The Perceptron algorithm is guaranteed to converge only if:
A.The data is normally distributed.
B.The weights are initialized to zero.
C.The learning rate is greater than 1.
D.The data is linearly separable.
Correct Answer: The data is linearly separable.
Explanation:
The Perceptron Convergence Theorem states that if the data is linearly separable, the algorithm will find a separating hyperplane in a finite number of steps.
Incorrect! Try again.
15Which function maps the output of a linear equation to a probability value in in Logistic Regression?
A.Sigmoid (Logistic)
B.Step
C.Softmax
D.Logarithm
Correct Answer: Sigmoid (Logistic)
Explanation:
The Sigmoid function, , maps real-valued numbers to the range .
Incorrect! Try again.
16The decision boundary generated by a standard Logistic Regression model is:
A.Linear
B.Irregular
C.Circular
D.Polynomial
Correct Answer: Linear
Explanation:
Logistic regression is a linear classifier because the decision boundary is determined by a linear combination of inputs ().
Incorrect! Try again.
17In scikit-learn's LogisticRegression, what is the purpose of the parameter C?
A.It controls the learning rate.
B.It determines the kernel type.
C.It is the inverse of regularization strength.
D.It sets the number of iterations.
Correct Answer: It is the inverse of regularization strength.
Explanation:
Smaller values of C specify stronger regularization (penalizing large weights), while larger values imply weaker regularization.
Incorrect! Try again.
18Which loss function is minimized in Logistic Regression?
A.Hinge Loss
B.Gini Impurity
C.Mean Squared Error
D.Log Loss (Cross-Entropy)
Correct Answer: Log Loss (Cross-Entropy)
Explanation:
Logistic regression minimizes the negative log-likelihood, also known as Log Loss or Binary Cross-Entropy.
Incorrect! Try again.
19How does k-Nearest Neighbors (k-NN) classify a new data point?
A.By finding the best splitting feature.
B.By projecting the point onto a hyperplane.
C.By calculating the probability using Bayes' theorem.
D.By taking a majority vote of the closest training examples.
Correct Answer: By taking a majority vote of the closest training examples.
Explanation:
k-NN identifies the training samples closest to the query point and assigns the most frequent class among them.
Incorrect! Try again.
20Why is k-NN often referred to as a lazy learner?
A.It uses a simple distance metric.
B.It only generalizes the data during the prediction phase.
C.It ignores outliers.
D.It trains very slowly.
Correct Answer: It only generalizes the data during the prediction phase.
Explanation:
Lazy learners do not build a model during the training phase; they store the training data and perform computation only when a prediction is requested.
Incorrect! Try again.
21In k-NN, what is the effect of choosing a very small value for (e.g., )?
A.High Bias, Low Variance (Underfitting)
B.Low Bias, High Variance (Overfitting)
C.The model becomes a linear classifier.
D.The decision boundary becomes smooth.
Correct Answer: Low Bias, High Variance (Overfitting)
Explanation:
With , the model is very sensitive to noise in the training data, leading to complex decision boundaries and potential overfitting (High Variance).
Incorrect! Try again.
22Which preprocessing step is critical for k-NN performance?
A.Removing correlations
B.Feature Scaling
C.One-hot encoding target labels
D.Increasing the number of features
Correct Answer: Feature Scaling
Explanation:
Since k-NN relies on distance calculations (like Euclidean), features with larger scales can dominate the distance metric if not normalized or standardized.
Incorrect! Try again.
23Which distance metric is calculated as ?
A.Minkowski Distance
B.Manhattan Distance
C.Euclidean Distance
D.Cosine Similarity
Correct Answer: Manhattan Distance
Explanation:
Manhattan distance (L1 norm) is the sum of the absolute differences between the coordinates.
Incorrect! Try again.
24In a Decision Tree, what does a leaf node represent?
A.The root of the tree.
B.A class label or probability.
C.A feature to split on.
D.A decision rule.
Correct Answer: A class label or probability.
Explanation:
Leaf nodes are the terminal nodes of the tree where no further splitting occurs, providing the final prediction.
Incorrect! Try again.
25Which metric does the CART algorithm (used by scikit-learn for Decision Trees) use by default to measure impurity?
A.Gini Impurity
B.Mean Squared Error
C.Entropy
D.Log Loss
Correct Answer: Gini Impurity
Explanation:
Scikit-learn's DecisionTreeClassifier uses criterion='gini' by default. It measures the probability of misclassifying a randomly chosen element.
Incorrect! Try again.
26Calculate the Gini Impurity of a node containing 3 positive samples and 3 negative samples.
A.0.25
B.0.0
C.1.0
D.0.5
Correct Answer: 0.5
Explanation:
Gini = .
Incorrect! Try again.
27Which hyperparameter in DecisionTreeClassifier can be used to control overfitting?
A.learning_rate
B.max_depth
C.C
D.kernel
Correct Answer: max_depth
Explanation:
Limiting max_depth prevents the tree from growing too complex and memorizing the training noise.
Incorrect! Try again.
28What is the concept of Information Gain in Decision Trees?
A.The time taken to train the tree.
B.The increase in accuracy after a split.
C.The reduction in entropy (or impurity) achieved by a split.
D.The total number of nodes in the tree.
Correct Answer: The reduction in entropy (or impurity) achieved by a split.
Explanation:
Information Gain measures how much information a feature provides about the class, calculated as Entropy(parent) - Weighted Average Entropy(children).
Incorrect! Try again.
29Decision Trees split the feature space into regions using boundaries that are:
A.Circular
B.Curved
C.Diagonal
D.Orthogonal to the feature axes
Correct Answer: Orthogonal to the feature axes
Explanation:
Standard decision trees make splits based on single features (e.g., ), resulting in rectangular decision regions aligned with the axes.
Incorrect! Try again.
30The primary objective of a Support Vector Machine (SVM) is to find a hyperplane that:
A.Separates data with zero error regardless of margin.
B.Minimizes the number of support vectors.
C.Maximizes the margin between classes.
D.Passes through the mean of the data.
Correct Answer: Maximizes the margin between classes.
Explanation:
SVM seeks the 'maximum margin hyperplane' to improve the generalization ability of the classifier.
Incorrect! Try again.
31What are Support Vectors in SVM?
A.The misclassified data points.
B.The centroids of the classes.
C.The data points furthest from the decision boundary.
D.The data points closest to the decision boundary.
Correct Answer: The data points closest to the decision boundary.
Explanation:
Support vectors are the critical elements of the training set that lie on the margin boundaries; they essentially define the hyperplane.
Incorrect! Try again.
32Which technique allows SVM to perform non-linear classification?
A.The Kernel Trick
B.Gradient Descent
C.Bagging
D.Pruning
Correct Answer: The Kernel Trick
Explanation:
The Kernel Trick maps input data into a higher-dimensional space where a linear separator can be found, without explicitly computing the coordinates.
Incorrect! Try again.
33In SVC (Support Vector Classifier), what does a high value of Gamma () imply for an RBF kernel?
A.The margin becomes wider.
B.The model fits the training data very closely (potential overfitting).
C.Each training example has a wide-reaching influence.
D.The decision boundary will be nearly linear.
Correct Answer: The model fits the training data very closely (potential overfitting).
Explanation:
High gamma means only points very close to the decision boundary are considered, leading to complex, tight boundaries that capture noise.
Incorrect! Try again.
34Which scikit-learn class is used for Support Vector Classification?
A.sklearn.tree.DecisionTreeClassifier
B.sklearn.svm.SVR
C.sklearn.svm.SVC
D.sklearn.linear_model.SGDClassifier
Correct Answer: sklearn.svm.SVC
Explanation:
SVC stands for Support Vector Classification. SVR is for Regression.
Incorrect! Try again.
35The Naïve Bayes classifier is based on which statistical theorem?
A.Central Limit Theorem
B.Pythagorean Theorem
C.Gauss-Markov Theorem
D.Bayes' Theorem
Correct Answer: Bayes' Theorem
Explanation:
Naïve Bayes applies Bayes' Theorem: .
Incorrect! Try again.
36What is the "Naïve" assumption in Naïve Bayes?
A.All features are mutually independent given the class.
B.The classes are balanced.
C.All features are equally important.
D.The data follows a normal distribution.
Correct Answer: All features are mutually independent given the class.
Explanation:
It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature, simplifying the computation.
Incorrect! Try again.
37Which variant of Naïve Bayes is best suited for continuous data assuming a bell-curve distribution?
A.MultinomialNB
B.GaussianNB
C.ComplementNB
D.BernoulliNB
Correct Answer: GaussianNB
Explanation:
Gaussian Naïve Bayes assumes that the likelihood of the features is Gaussian (normally distributed).
Incorrect! Try again.
38In Text Classification with word counts, which Naïve Bayes variant is typically used?
A.GaussianNB
B.LinearNB
C.LogisticNB
D.MultinomialNB
Correct Answer: MultinomialNB
Explanation:
Multinomial Naïve Bayes is suitable for features that represent counts or discrete frequencies (like word counts in text).
Incorrect! Try again.
39What is Laplace Smoothing used for in Naïve Bayes?
A.To normalize the dataset.
B.To prevent zero probabilities for unseen features.
C.To reduce the number of features.
D.To handle continuous variables.
Correct Answer: To prevent zero probabilities for unseen features.
Explanation:
If a feature value in the test set was not present in the training set for a class, the probability becomes 0. Smoothing adds a small count (usually 1) to avoid this.
Incorrect! Try again.
40Which of the following classifiers is a Generative Model?
A.Logistic Regression
B.Support Vector Machine
C.Naïve Bayes
D.Decision Tree
Correct Answer: Naïve Bayes
Explanation:
Naïve Bayes models the joint probability (how the data is generated), whereas the others are Discriminative models modeling directly.
Incorrect! Try again.
41To handle a multi-class classification problem with a binary classifier like Logistic Regression, which strategy is commonly used?
A.Gradient Boosting
B.Kernel Trick
C.One-vs-Rest (OvR)
D.Pruning
Correct Answer: One-vs-Rest (OvR)
Explanation:
OvR trains one classifier per class (Class vs all other classes) and selects the class with the highest confidence score.
Incorrect! Try again.
42Which metric is calculated using the formula: ?
A.F1-Score
B.Matthews Correlation Coefficient
C.Specificity
D.Accuracy
Correct Answer: F1-Score
Explanation:
This is an algebraic rearrangement of the F1-Score formula (Harmonic mean of Precision and Recall).
Incorrect! Try again.
43If a Decision Tree is fully grown until all leaves are pure, it is likely to have:
A.High Variance (Overfitting)
B.High Bias
C.Low Variance
D.Low Accuracy on training data
Correct Answer: High Variance (Overfitting)
Explanation:
A fully grown tree captures all the noise and specific patterns of the training data, leading to poor generalization (Overfitting/High Variance).
Incorrect! Try again.
44In the context of the Confusion Matrix, Specificity is also known as:
A.False Positive Rate
B.True Positive Rate
C.True Negative Rate
D.Precision
Correct Answer: True Negative Rate
Explanation:
Specificity measures the proportion of actual negatives that are correctly identified: .
Incorrect! Try again.
45Which scikit-learn utility is best used to split data into training and testing sets?
A.GridSearchCV
B.cross_val_score
C.StandardScaler
D.train_test_split
Correct Answer: train_test_split
Explanation:
train_test_split from sklearn.model_selection is the standard function to shuffle and split arrays into training and testing subsets.
Incorrect! Try again.
46What happens to the decision boundary of a Logistic Regression model if the regularization parameter is very small?
A.The boundary becomes non-linear.
B.The model underfits (high bias).
C.The model overfits.
D.The coefficients become large.
Correct Answer: The model underfits (high bias).
Explanation:
A small C implies strong regularization, forcing weights to be small. This restricts model complexity, potentially leading to underfitting.
Incorrect! Try again.
47Which of the following algorithms does NOT produce a linear decision boundary (without kernels)?
A.k-Nearest Neighbors
B.Linear Perceptron
C.Logistic Regression
D.Linear SVM
Correct Answer: k-Nearest Neighbors
Explanation:
k-NN produces complex, non-linear decision boundaries based on local neighborhoods, unlike the other linear methods.
Incorrect! Try again.
48In SVM, which kernel is defined as ?
A.Linear Kernel
B.Polynomial Kernel
C.RBF Kernel
D.Sigmoid Kernel
Correct Answer: Polynomial Kernel
Explanation:
This is the mathematical formulation for a Polynomial kernel of degree .
Incorrect! Try again.
49What is the primary advantage of Naïve Bayes classifiers regarding training time?
A.They depend on the number of support vectors.
B.They are fast because they require a single pass over the data.
C.They are very slow due to iterative optimization.
D.They are slow because they calculate distances between all points.
Correct Answer: They are fast because they require a single pass over the data.
Explanation:
Naïve Bayes is computationally efficient because it only requires calculating prior probabilities and conditional feature probabilities, which can be done in one pass.