1Which type of machine learning uses labeled data, where each data point is tagged with a correct output, to train a model?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Unsupervised Learning
B.Semi-supervised Learning
C.Supervised Learning
D.Reinforcement Learning
Correct Answer: Supervised Learning
Explanation:
Supervised learning is characterized by its use of labeled datasets. The algorithm learns from input-output pairs to make predictions on new, unseen data.
Incorrect! Try again.
2What measure of central tendency represents the most frequently occurring value in a dataset?
Statistics
Easy
A.Mean
B.Median
C.Mode
D.Range
Correct Answer: Mode
Explanation:
The mode is the value that appears most often in a set of data. The mean is the average, and the median is the middle value when the data is sorted.
Incorrect! Try again.
3If you roll a single fair six-sided die, what is the probability of rolling an even number (2, 4, or 6)?
Probability
Easy
A.2/3
B.1/3
C.1/2
D.1/6
Correct Answer: 1/2
Explanation:
A six-sided die has three even numbers (2, 4, 6) out of six possible outcomes. Therefore, the probability is 3/6, which simplifies to 1/2.
Incorrect! Try again.
4In machine learning, a collection of numbers representing a single data point (e.g., height, weight, and age of a person) is typically stored as a:
Linear algebra (applied focus)
Easy
A.Matrix
B.Scalar
C.Tensor
D.Vector
Correct Answer: Vector
Explanation:
A vector is a 1-dimensional array of numbers, which is a convenient way to represent the features of a single observation or data point.
Incorrect! Try again.
5What is the primary purpose of cross-validation in model evaluation?
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.To make the model train faster
B.To simplify the features of the data
C.To always increase the model's accuracy on the training data
D.To assess how a model will generalize to an independent, unseen dataset
Correct Answer: To assess how a model will generalize to an independent, unseen dataset
Explanation:
Cross-validation is a technique for evaluating a model's performance by training it on subsets of the data and testing it on the complementary subset, giving a more robust estimate of its performance on new data.
Incorrect! Try again.
6Grouping customers into different segments based on their purchasing habits, without any predefined labels for the groups, is a classic example of:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Reinforcement Learning
B.Supervised Learning
C.Regression
D.Unsupervised Learning
Correct Answer: Unsupervised Learning
Explanation:
This task is known as clustering, where the goal is to find hidden patterns or structures in unlabeled data. Since there are no predefined labels for the customer groups, it is an unsupervised learning problem.
Incorrect! Try again.
7Bayes' Theorem is used to update the probability of a hypothesis based on new evidence. What is the term for this updated probability?
Bayes theorem
Easy
A.Likelihood
B.Posterior probability
C.Marginal probability
D.Prior probability
Correct Answer: Posterior probability
Explanation:
The posterior probability, often written as , is the revised probability of a hypothesis H after observing evidence E. It's the result of applying Bayes' theorem.
Incorrect! Try again.
8A 2-dimensional grid of numbers, often used to represent an entire dataset where rows are data points and columns are features, is called a:
Linear algebra (applied focus)
Easy
A.Vector
B.Matrix
C.Diagonal
D.Scalar
Correct Answer: Matrix
Explanation:
A matrix is a rectangular array of numbers arranged in rows and columns. It's the standard way to represent a tabular dataset in machine learning.
Incorrect! Try again.
9In the context of a binary classification model, what does 'Precision' measure?
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.The model's ability to correctly identify negative instances
B.The proportion of actual positive instances that were correctly identified
C.The overall accuracy of the model across all classes
D.The proportion of positive predictions that were actually correct
Correct Answer: The proportion of positive predictions that were actually correct
Explanation:
Precision asks the question: 'Of all the instances the model predicted as positive, how many were actually positive?'. It is calculated as , where TP is True Positives and FP is False Positives.
Incorrect! Try again.
10A self-driving car AI learning to navigate by receiving a 'reward' for a correct action and a 'penalty' for a mistake is using which type of machine learning?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Reinforcement Learning
B.Supervised Learning
C.Unsupervised Learning
D.Dimensionality Reduction
Correct Answer: Reinforcement Learning
Explanation:
Reinforcement learning involves an 'agent' learning to make decisions by performing actions in an environment to maximize a cumulative reward. This trial-and-error approach with rewards and penalties is the core of RL.
Incorrect! Try again.
11What is the 'median' of the following set of numbers: [1, 7, 3, 9, 5]?
Statistics
Easy
A.7
B.3
C.5
D.25
Correct Answer: 5
Explanation:
To find the median, you must first sort the numbers: [1, 3, 5, 7, 9]. The median is the middle value in the sorted list, which is 5.
Incorrect! Try again.
12What does a node in a Bayesian Network typically represent?
Bayesian networks, and probabilistic reasoning
Easy
A.A random variable
B.An entire dataset
C.A machine learning algorithm
D.A deterministic value
Correct Answer: A random variable
Explanation:
A Bayesian Network is a probabilistic graphical model. Each node (or vertex) in the graph represents a random variable, and the edges represent conditional dependencies between them.
Incorrect! Try again.
13The probability of any event is always a number between:
Probability
Easy
A.1 and infinity
B.-1 and 1 (inclusive)
C.0 and 1 (inclusive)
D.0 and 100 (inclusive)
Correct Answer: 0 and 1 (inclusive)
Explanation:
Probability is a measure of the likelihood of an event occurring. A probability of 0 means the event is impossible, and a probability of 1 means the event is certain. It cannot be negative or greater than 1.
Incorrect! Try again.
14The process of selecting, transforming, or creating the most suitable input variables for a machine learning model is called:
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.Feature engineering
B.Algorithm selection
C.Model evaluation
D.Cross-validation
Correct Answer: Feature engineering
Explanation:
Feature engineering is a crucial pre-modeling step that uses domain knowledge to create features that make machine learning algorithms work better. It directly impacts model performance.
Incorrect! Try again.
15Predicting the exact price of a house based on its size, location, and number of bedrooms is an example of what kind of problem?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Association
B.Classification
C.Clustering
D.Regression
Correct Answer: Regression
Explanation:
Regression is a type of supervised learning task where the goal is to predict a continuous numerical value (like a price), rather than a discrete category.
Incorrect! Try again.
16What is a scalar in the context of linear algebra?
Linear algebra (applied focus)
Easy
A.A type of model
B.A single number
C.A grid of numbers
D.An array of numbers
Correct Answer: A single number
Explanation:
A scalar is simply an ordinary number (an integer or a real number). It is used to scale vectors and matrices through multiplication.
Incorrect! Try again.
17Email spam detection, where an algorithm is trained on emails already labeled as 'spam' or 'not spam', is a task known as:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Clustering
B.Regression
C.Reinforcement Learning
D.Classification
Correct Answer: Classification
Explanation:
Classification is a supervised learning task where the model learns to assign a category or class label (e.g., 'spam' or 'not spam') to a new observation.
Incorrect! Try again.
18In a dataset of student test scores, the difference between the highest and lowest score is known as the:
Statistics
Easy
A.Mean
B.Range
C.Variance
D.Standard Deviation
Correct Answer: Range
Explanation:
The range is the simplest measure of statistical dispersion or variability. It is calculated by subtracting the minimum value from the maximum value in a dataset.
Incorrect! Try again.
19In a Bayesian Network, what does a directed edge (an arrow) from Node A to Node B signify?
Bayesian networks, and probabilistic reasoning
Easy
A.Node A and Node B are completely independent
B.Node A and Node B have the same probability distribution
C.The state of Node B directly influences the probability of the state of Node A
D.The state of Node A directly influences the probability of the state of Node B
Correct Answer: The state of Node A directly influences the probability of the state of Node B
Explanation:
The directed edges in a Bayesian Network represent conditional dependencies. An arrow from A to B means that the probability of B is dependent on the value of A. A is considered a 'parent' of B.
Incorrect! Try again.
20A 'False Positive' in a medical test designed to detect a disease means:
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.The test incorrectly indicates a healthy person has the disease
B.The test incorrectly indicates a sick person is healthy
C.The test correctly indicates a healthy person is healthy
D.The test correctly indicates a sick person has the disease
Correct Answer: The test incorrectly indicates a healthy person has the disease
Explanation:
A False Positive (also called a Type I error) is an outcome where the model incorrectly predicts the positive class. In this case, 'having the disease' is the positive class, so the test falsely identifies a healthy person as being sick.
Incorrect! Try again.
21A model designed to detect a rare but critical disease has a high precision of 95% but a very low recall of 10%. What is the most accurate interpretation of this result?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.The model has a high overall accuracy and is performing well.
B.The model is highly reliable when it predicts a patient has the disease, but it misses most of the actual positive cases.
C.The model incorrectly flags many healthy patients as having the disease.
D.The model correctly identifies most of the patients who have the disease.
Correct Answer: The model is highly reliable when it predicts a patient has the disease, but it misses most of the actual positive cases.
Explanation:
High precision (True Positives / (True Positives + False Positives)) means that when the model predicts a positive case, it is very likely to be correct. However, low recall (True Positives / (True Positives + False Negatives)) means the model fails to identify a large number of actual positive cases (it has many false negatives). This is a critical issue in medical diagnosis where failing to detect a disease is often more dangerous than a false alarm.
Incorrect! Try again.
22An e-commerce company wants to group its customers into distinct segments based on purchasing behavior (e.g., frequency, items bought, total spending) without any predefined labels for these segments. Which type of machine learning is most suitable for this task?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Semi-Supervised Learning
B.Unsupervised Learning
C.Reinforcement Learning
D.Supervised Learning
Correct Answer: Unsupervised Learning
Explanation:
This is a classic clustering problem. Since there are no predefined labels for the customer segments, the goal is to discover inherent structures or patterns in the data. Unsupervised learning algorithms, such as K-Means or DBSCAN, are designed for this purpose.
Incorrect! Try again.
23In Principal Component Analysis (PCA), the eigenvectors of the data's covariance matrix represent the...
Linear algebra (applied focus)
Medium
A.average value of each feature.
B.variance of each principal component.
C.number of clusters in the data.
D.directions of maximum variance in the data.
Correct Answer: directions of maximum variance in the data.
Explanation:
PCA works by finding a new set of orthogonal axes, called principal components, that align with the directions of maximum variance in the dataset. These directions are defined by the eigenvectors of the covariance matrix. The first principal component corresponds to the eigenvector with the largest eigenvalue, capturing the most variance.
Incorrect! Try again.
24A medical test for a disease has a 99% accuracy rate (it's correct 99% of the time). The disease has a prevalence of 1 in 10,000 people. If a randomly selected person tests positive, what can you conclude about the probability that they actually have the disease?
Bayes theorem, Bayesian networks, and probabilistic reasoning
Medium
A.It is impossible to determine without knowing the false negative rate.
B.The probability is very high, but slightly less than 99%.
C.The probability is 99%.
D.The probability is actually quite low (much less than 50%).
Correct Answer: The probability is actually quite low (much less than 50%).
Explanation:
This is a classic example of the base rate fallacy, solved using Bayes' theorem. Because the disease is so rare, the number of false positives from the large healthy population (9,999 out of 10,000 people will generate about 100 false positives) will far outnumber the true positives from the small sick population (1 out of 10,000 will likely generate 1 true positive). Therefore, a positive test result is more likely to be a false positive than a true positive.
Incorrect! Try again.
25A dataset of employee salaries at a tech company is heavily right-skewed due to a few extremely high executive salaries. Which measure of central tendency would provide the most realistic representation of a 'typical' employee's salary?
Statistics
Medium
A.Standard Deviation
B.Mean
C.Median
D.Mode
Correct Answer: Median
Explanation:
The mean is highly sensitive to outliers and extreme values. In a right-skewed distribution, the mean is pulled upwards by the high values, making it an inflated representation of the central tendency. The median, which is the middle value when data is sorted, is robust to outliers and will give a much better indication of the typical salary.
Incorrect! Try again.
26A program learns to play chess by making moves and receiving a reward of +1 for a win, -1 for a loss, and 0 for a draw after each game. The program's goal is to maximize its cumulative reward over many games. This scenario is a prime example of:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Clustering (Unsupervised Learning)
B.Reinforcement Learning
C.Classification (Supervised Learning)
D.Regression (Supervised Learning)
Correct Answer: Reinforcement Learning
Explanation:
This problem has all the key elements of reinforcement learning: an agent (the program) interacting with an environment (the chess game), taking actions (making moves), and receiving delayed rewards (the outcome of the game). The agent learns a policy (a strategy for choosing moves) to maximize its long-term reward.
Incorrect! Try again.
27To evaluate a model's performance and generalizability while tuning its hyperparameters, the standard practice is to split the data into three sets. What is the primary purpose of the 'validation' set?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.To select the best model hyperparameters without 'leaking' information from the test set.
B.To provide a final, unbiased evaluation of the model's performance on unseen data.
C.To increase the amount of data available for training.
D.To train the final model after hyperparameters have been chosen.
Correct Answer: To select the best model hyperparameters without 'leaking' information from the test set.
Explanation:
The training set is used to fit the model. The validation set is used to evaluate different hyperparameter settings and choose the best-performing model. The test set is kept completely separate until the very end to provide a final, unbiased estimate of how the chosen model will perform on new, unseen data. Using the test set for hyperparameter tuning would lead to an overly optimistic performance estimate.
Incorrect! Try again.
28In natural language processing, words are often represented as high-dimensional vectors (word embeddings). The cosine similarity between two word vectors measures:
Linear algebra (applied focus)
Medium
A.The difference in word frequency.
B.The semantic similarity or relatedness of the words.
C.The Euclidean distance between the words in the vector space.
D.The number of characters the words have in common.
Correct Answer: The semantic similarity or relatedness of the words.
Explanation:
Cosine similarity measures the cosine of the angle between two vectors. In the context of word embeddings, vectors for semantically similar words (e.g., 'king' and 'queen') are designed to point in similar directions. A cosine similarity close to 1 indicates a small angle and thus high semantic similarity, while a value close to 0 indicates orthogonality or dissimilarity.
Incorrect! Try again.
29In a classification problem, the output of a logistic regression model for a given input is 0.7. What does this value represent?
Probability
Medium
A.The probability that the input belongs to the positive class.
B.The predicted class label is 0.7.
C.The accuracy of the model on this specific input.
D.The margin of separation from the decision boundary.
Correct Answer: The probability that the input belongs to the positive class.
Explanation:
Logistic regression models the probability that an input belongs to a particular class. The sigmoid function at its output squashes any real-valued number into the range [0, 1], which is interpreted as a probability. A value of 0.7 means the model estimates a 70% probability that the sample belongs to the positive class (class 1).
Incorrect! Try again.
30In a simple Bayesian Network representing a medical diagnosis, we have the structure: Disease -> Symptom. If we know a patient has the Disease, what does this tell us about the probability of them having the Symptom?
Bayes theorem, Bayesian networks, and probabilistic reasoning
Medium
A.Knowing the patient has the Disease makes the Symptom certain to occur.
B.Knowing the patient has the Disease does not change the probability of the Symptom.
C.The probability of the Symptom becomes 0.
D.The probability of the Symptom is now conditioned on the presence of the Disease, and is given by .
Correct Answer: The probability of the Symptom is now conditioned on the presence of the Disease, and is given by .
Explanation:
The arrow from Disease to Symptom indicates a direct causal or influential relationship, where the state of Disease affects the probability of Symptom. The network encodes the conditional probability . Observing that the patient has the disease means we use this conditional probability table to update our belief about the likelihood of the symptom.
Incorrect! Try again.
31What is the primary motivation for standardizing features (e.g., using Z-score normalization) before applying distance-based algorithms like K-Nearest Neighbors (KNN)?
Statistics
Medium
A.To make the data conform to a normal distribution.
B.To reduce the number of features in the dataset.
C.To prevent features with larger scales from dominating the distance calculations.
D.To convert all features into a [0, 1] range.
Correct Answer: To prevent features with larger scales from dominating the distance calculations.
Explanation:
KNN calculates distances (like Euclidean distance) between data points. If one feature has a much larger scale than others (e.g., salary in dollars vs. years of experience), its contribution to the distance calculation will overwhelm the others. Standardization rescales features to have a mean of 0 and a standard deviation of 1, ensuring that all features contribute more equally to the distance metric.
Incorrect! Try again.
32If you multiply a 2D vector by the matrix , what geometric transformation is applied to the vector?
Linear algebra (applied focus)
Medium
A.A projection onto the x-axis.
B.A 90-degree counter-clockwise rotation.
C.A scaling by a factor of 2.
D.A reflection across the y-axis.
Correct Answer: A 90-degree counter-clockwise rotation.
Explanation:
Let's apply the transformation to a basis vector, e.g., . The result is . The vector on the x-axis is rotated to the y-axis. Applying it to gives . This consistent behavior demonstrates a 90-degree counter-clockwise rotation around the origin.
Incorrect! Try again.
33You perform 5-fold cross-validation to evaluate your machine learning model. If your dataset has 1000 instances, how many instances are in the training set for each of the 5 iterations?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.500
B.1000
C.800
D.200
Correct Answer: 800
Explanation:
In k-fold cross-validation, the dataset is divided into k equal (or nearly equal) folds. For each iteration, one fold is used as the validation set, and the remaining k-1 folds are used as the training set. With 1000 instances and k=5, each fold has 1000/5 = 200 instances. Therefore, the training set in each iteration will consist of 4 folds, which is 4 * 200 = 800 instances.
Incorrect! Try again.
34Which of the following problems is best framed as a regression task rather than a classification task?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Predicting the price of a house based on its features.
B.Determining if an email is spam or not spam.
C.Predicting whether a customer will churn (yes/no).
D.Identifying the species of a flower from a photo.
Correct Answer: Predicting the price of a house based on its features.
Explanation:
Classification tasks involve predicting a discrete, categorical label (e.g., yes/no, spam/not spam, flower species). Regression tasks involve predicting a continuous numerical value. Predicting a house price, which can be any value within a range, is a quintessential regression problem.
Incorrect! Try again.
35Two events A and B are mutually exclusive. If P(A) = 0.4 and P(B) = 0.3, what is the probability of A or B occurring, i.e., ?
Probability
Medium
A.0.12
B.1.0
C.0.7
D.0.1
Correct Answer: 0.7
Explanation:
The general formula for the union of two events is . For mutually exclusive events, they cannot occur at the same time, so their joint probability is 0. Therefore, the formula simplifies to .
Incorrect! Try again.
36You are building a spam filter. Given that P(Spam) = 0.2, P(contains 'free' | Spam) = 0.8, and P(contains 'free' | Not Spam) = 0.05. Using Bayes' theorem, what are you trying to calculate?
Bayes theorem, Bayesian networks, and probabilistic reasoning
Medium
A.P(contains 'free')
B.P(Spam, contains 'free')
C.P(Not Spam)
D.P(Spam | contains 'free')
Correct Answer: P(Spam | contains 'free')
Explanation:
The goal of a spam filter is to determine the probability that an email is spam given that we have observed some evidence (like the word 'free'). This is a conditional probability, specifically the posterior probability . Bayes' theorem provides the framework to calculate this using the prior probability P(Spam), the likelihood P(contains 'free' | Spam), and the evidence P(contains 'free').
Incorrect! Try again.
37What is the primary risk of performing feature selection based on the model's performance on the final test set?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.It is computationally too expensive.
B.It can lead to a model that is too simple (underfitting).
C.It causes information from the test set to leak into the model selection process, leading to an over-optimistic performance estimate.
D.It requires the data to be normally distributed.
Correct Answer: It causes information from the test set to leak into the model selection process, leading to an over-optimistic performance estimate.
Explanation:
The test set must be held out and used only once for a final, unbiased evaluation. If you use the test set to guide your feature selection (or any other model tuning), you are implicitly fitting your model to the test set. The resulting performance metric will be inflated because the model has already 'seen' the data in some form, and it will not generalize as well to truly new, unseen data.
Incorrect! Try again.
38What does it imply if the determinant of a transformation matrix used in a machine learning model is zero?
Linear algebra (applied focus)
Medium
A.The transformation scales the data uniformly.
B.The transformation is an identity operation (no change).
C.The transformation is a pure rotation.
D.The transformation collapses the data into a lower-dimensional space.
Correct Answer: The transformation collapses the data into a lower-dimensional space.
Explanation:
The determinant of a matrix represents the factor by which area (in 2D) or volume (in 3D) is scaled by the transformation. A determinant of zero means that the area/volume becomes zero. This happens when the transformation squashes the data onto a line (from 2D) or a plane (from 3D), effectively reducing its dimensionality and making the transformation non-invertible.
Incorrect! Try again.
39In a linear regression analysis, you create a plot of residuals versus fitted values and observe a distinct funnel shape (heteroscedasticity). What key assumption of linear regression does this violate?
Statistics
Medium
A.Normality of errors.
B.Linearity of the relationship.
C.Constant variance of errors (homoscedasticity).
D.Independence of errors.
Correct Answer: Constant variance of errors (homoscedasticity).
Explanation:
One of the core assumptions of linear regression is homoscedasticity, which means the variance of the residuals (errors) should be constant across all levels of the independent variables. A funnel shape in the residual plot indicates that the error variance is not constant (it increases or decreases with the fitted values), which is known as heteroscedasticity. This can affect the reliability of the model's coefficient estimates and significance tests.
Incorrect! Try again.
40A hospital has a large dataset of patient records, where a small fraction of the records are labeled with a correct diagnosis by expert doctors, but the vast majority are unlabeled. The goal is to build a diagnostic model using all the available data. This problem is an example of:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Multi-task Learning
B.Semi-Supervised Learning
C.Active Learning
D.Transfer Learning
Correct Answer: Semi-Supervised Learning
Explanation:
Semi-supervised learning is a paradigm that falls between supervised and unsupervised learning. It is used for problems where you have a small amount of labeled data and a large amount of unlabeled data. The goal is to leverage the structure within the unlabeled data to improve the performance of a model that is initially trained on the small labeled set.
Incorrect! Try again.
41In the context of Principal Component Analysis (PCA), if the covariance matrix of your centered data is a non-identity diagonal matrix (i.e., diagonal entries are positive but not all equal to 1), what can you definitively conclude about the principal components?
Linear algebra (applied focus)
Hard
A.The data is perfectly correlated, and PCA will reduce its dimensionality to one.
B.The principal components will be at a 45-degree angle to the original axes, reflecting an average of the variances.
C.The principal components are aligned with the original feature axes, and the transformation is essentially a scaling, not a rotation.
D.The covariance matrix is singular, and PCA cannot be computed.
Correct Answer: The principal components are aligned with the original feature axes, and the transformation is essentially a scaling, not a rotation.
Explanation:
A diagonal covariance matrix indicates that the original features are already uncorrelated. The goal of PCA is to find a new, orthogonal basis of uncorrelated variables (the principal components). Since the original feature axes already form an orthogonal basis of uncorrelated variables, the principal components will be aligned with these axes. The transformation matrix will be a permutation matrix or the identity matrix, meaning there is no rotation of the data, only scaling based on the variances (the diagonal entries).
Incorrect! Try again.
42You are developing a credit fraud detection model with a dataset where only 0.1% of transactions are fraudulent. After training, you achieve 99.9% accuracy. You then evaluate using the Area Under the Precision-Recall Curve (AUC-PR) and get a score of 0.2. What is the most accurate and nuanced interpretation of these results?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.The high accuracy is a misleading metric due to extreme class imbalance, and the AUC-PR of 0.2, while appearing low, is significantly better than a random baseline and indicates the model has some, albeit imperfect, skill.
B.An AUC-PR of 0.2 is very poor for any dataset, indicating the model's predictions are no better than random guessing.
C.The model is excellent because the accuracy is nearly perfect, and the low AUC-PR score must be an error in calculation or interpretation.
D.The model is severely overfitting, as indicated by the large discrepancy between the accuracy score and the AUC-PR score.
Correct Answer: The high accuracy is a misleading metric due to extreme class imbalance, and the AUC-PR of 0.2, while appearing low, is significantly better than a random baseline and indicates the model has some, albeit imperfect, skill.
Explanation:
In a highly imbalanced dataset, a naive model predicting the majority class (non-fraudulent) would achieve 99.9% accuracy, making this metric useless. The AUC-PR is a more informative metric here. The baseline for AUC-PR (the score of a random classifier) is equal to the fraction of positives, which is 0.001. A score of 0.2 is 200 times better than random. Therefore, it correctly identifies that accuracy is misleading and that the model has learned a meaningful signal, even if its performance is not perfect.
Incorrect! Try again.
43In the Bayesian network defined by the structure , which statement about the probabilistic relationship between nodes A and C is correct?
Bayesian networks, and probabilistic reasoning
Hard
A.A and C are marginally dependent but become conditionally independent given B.
B.A and C are both marginally and conditionally dependent on each other.
C.A and C are marginally independent but become conditionally dependent given B.
D.A and C are both marginally and conditionally independent of each other.
Correct Answer: A and C are marginally independent but become conditionally dependent given B.
Explanation:
This structure is known as a 'v-structure' or a 'collider'. Node B is a collider because two arrows point into it. In a v-structure, the path between A and C is blocked by default, making them marginally independent (). However, if we observe or condition on the collider B (or any of its descendants), the path becomes unblocked, and information can flow between A and C. This makes them conditionally dependent. This phenomenon is often called 'explaining away'.
Incorrect! Try again.
44A robotics company wants to train a bipedal robot to walk on varied and unseen terrain. The robot receives sensor data about its joint angles and orientation and a sparse positive reward only when it reaches a destination. It receives a large negative reward if it falls. Which specific class of algorithms within a broader ML paradigm is most appropriate for this task?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
This is a classic Reinforcement Learning problem due to the agent-environment interaction and reward-based learning. Supervised learning is not feasible as there is no labeled dataset of 'correct' movements. Unsupervised learning can help with state representation but doesn't solve the core control problem. Within RL, the dynamics of a bipedal robot on varied terrain are extremely complex and difficult to model accurately, making model-based RL challenging. The problem involves a continuous action space (joint torques/angles) and requires learning a stochastic policy, which makes model-free, policy-based methods like Proximal Policy Optimization (PPO) or Asynchronous Advantage Actor-Critic (A3C) the state-of-the-art and most suitable approach.
Incorrect! Try again.
45When solving a linear regression problem using the normal equation, , you find that the matrix is singular (non-invertible). What is the most likely data-related cause and the mathematical consequence?
Linear algebra (applied focus)
Hard
A.Cause: The number of samples is far greater than the number of features. Consequence: The model will underfit the data.
B.Cause: Features are not normalized. Consequence: The inverse operation is numerically unstable, but a unique solution technically exists.
C.Cause: Perfect multicollinearity among features. Consequence: The system has infinite solutions for the regression coefficients .
D.Cause: The target variable y contains extreme outliers. Consequence: The matrix inverse cannot be computed.
Correct Answer: Cause: Perfect multicollinearity among features. Consequence: The system has infinite solutions for the regression coefficients .
Explanation:
The matrix , known as the Gram matrix, is singular if and only if the columns of X (the features) are linearly dependent. This condition is called perfect multicollinearity. From a linear algebra perspective, if is singular, it does not have an inverse, meaning the normal equation does not yield a unique solution. Instead, it defines a system of linear equations with either no solution or infinite solutions. In the context of least squares, this corresponds to an infinite number of coefficient vectors that all achieve the same minimum squared error.
Incorrect! Try again.
46In a K-fold cross-validation setup, what is the primary statistical trade-off when increasing the value of K from 5 to N (where N is the total number of samples), a technique also known as Leave-One-Out Cross-Validation (LOOCV)?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.The computational cost decreases, but the bias of the estimate increases.
B.The bias of the performance estimate decreases, but its variance increases significantly.
C.Both the bias and variance of the performance estimate decrease, improving reliability.
D.The variance of the performance estimate decreases, but its bias increases.
Correct Answer: The bias of the performance estimate decreases, but its variance increases significantly.
Explanation:
As K increases, the training set size for each fold () approaches the full dataset size N. Models trained on more data are generally better, so the performance estimate becomes less biased (it's a better estimate of the true performance of a model trained on N samples). However, with LOOCV, the N training sets are nearly identical (differing by only one sample). This high correlation between the models trained in each fold leads to a high variance in the final performance estimate. The average of highly correlated variables has a much higher variance than the average of independent variables. Thus, the trade-off is accepting higher variance for lower bias.
Incorrect! Try again.
47A rare disease affects 1 in 10,000 people. A test for the disease is developed with a 99% true positive rate and a 98% true negative rate. If a person tests positive, what is the probability they actually have the disease? The key challenge here is the combination of a rare event and an imperfect test.
Bayes theorem
Hard
A.Approximately 2%
B.Approximately 50%
C.Approximately 99%
D.Approximately 0.49%
Correct Answer: Approximately 0.49%
Explanation:
This requires a careful application of Bayes' theorem. Let D be having the disease, and T be testing positive. We want .
We are given:
(True Positive Rate)
(True Negative Rate), so (False Positive Rate)
Using Bayes' theorem:
This is approximately 0.49%. Even with a seemingly accurate test, the vast number of false positives from the healthy population swamps the true positives from the sick population due to the disease's rarity.
Incorrect! Try again.
48What is the primary motivation for using an off-policy reinforcement learning algorithm like Q-Learning over an on-policy one like SARSA?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.It guarantees faster convergence by reducing the variance of the value function updates.
B.It allows the agent to learn about the optimal policy while behaving according to an exploratory (sub-optimal) policy.
C.It is a model-based approach, which is more sample-efficient than the model-free on-policy methods.
D.It avoids the need for a discount factor gamma, simplifying the Bellman equation.
Correct Answer: It allows the agent to learn about the optimal policy while behaving according to an exploratory (sub-optimal) policy.
Explanation:
The core strength of off-policy learning is the decoupling of the target policy (the policy we want to learn) from the behavior policy (the policy used to generate experience). Q-Learning's update rule uses the max Q-value for the next state, effectively learning about the greedy (optimal) policy, regardless of which action was actually taken by the exploratory behavior policy. SARSA, an on-policy algorithm, updates its Q-values based on the action actually taken, thus learning the value of its current behavior policy (including its exploration steps). This decoupling allows off-policy methods to learn from historical data or from a human expert's actions, which is not possible with on-policy methods.
Incorrect! Try again.
49A dataset has two binary features, and , and a binary class . You observe that and . If you also know that and are conditionally independent given , can be lower than 0.6? Why or why not?
Probability
Hard
A.No, the conditional independence assumption implies that , which would be .
B.No, because the two features independently provide evidence for Y=1, their combined evidence must be stronger than either alone.
C.Yes, but only if the features and are negatively correlated.
D.Yes, this can happen if and are both very high, causing the evidence for to accumulate faster.
Correct Answer: Yes, this can happen if and are both very high, causing the evidence for to accumulate faster.
Explanation:
This is a non-intuitive result related to the base rates and evidence accumulation (Simpson's Paradox can be related). The probability depends on the likelihoods of seeing and for both classes ( and ). While both features individually increase the probability of , it is possible that the combination is extremely common when . If and are both very high (e.g., 0.9), their product under the independence assumption makes the likelihood very high. This strong evidence for can potentially outweigh the evidence for , pushing the posterior probability down, possibly even below 0.6.
Incorrect! Try again.
50You are comparing two complex models (e.g., a deep neural network and a gradient boosting machine) using 5x2 cross-validation and a paired t-test on the results to claim statistical significance. According to Dietterich (1998), why might this statistical test have a higher than desired Type I error rate in this specific machine learning context?
Statistics
Hard
A.The training sets in cross-validation are highly overlapping, which violates the independence assumption of the t-test, leading to an underestimation of the true variance.
B.A paired t-test cannot be used to compare two different algorithms; it can only compare one algorithm with different hyperparameters.
C.The 5x2 CV procedure systematically biases the performance estimate in favor of the more complex model.
D.The underlying distribution of model accuracies is often not Gaussian, which is a core assumption of the t-test.
Correct Answer: The training sets in cross-validation are highly overlapping, which violates the independence assumption of the t-test, leading to an underestimation of the true variance.
Explanation:
The standard t-test assumes that the measurements (the performance differences per fold) are independent. However, in k-fold cross-validation, the training sets for any two folds overlap by a large amount. This means the models produced are not independent, and their performance scores are correlated. This correlation leads to an underestimation of the variance of the average difference. A smaller estimated variance makes the t-statistic larger, increasing the likelihood of rejecting the null hypothesis when it is true (a Type I error). Dietterich's seminal paper showed that this issue is particularly pronounced for standard k-fold CV and proposed the 5x2 CV t-test as a partial remedy, though the core issue of dependence remains a concern in ML model comparison.
Incorrect! Try again.
51You apply both K-Means and a Gaussian Mixture Model (GMM) with 3 components to a dataset. You find that K-Means identifies three well-separated, spherical clusters, while the GMM identifies three overlapping, elliptical clusters. Which statement is the most valid conclusion?
Unsupervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.K-Means is the correct choice because it produced well-separated clusters, indicating the GMM is overfitting the data by creating complex shapes.
B.GMM provides a more nuanced result by modeling cluster covariance and providing probabilistic assignments, which is likely superior if the true clusters are not perfectly spherical and separated.
C.Both algorithms have failed; K-Means is too simplistic and GMM is too complex, so a density-based algorithm like DBSCAN should be used.
D.The results are equivalent, as GMM is just a probabilistic version of K-Means and will always converge to a similar result.
Correct Answer: GMM provides a more nuanced result by modeling cluster covariance and providing probabilistic assignments, which is likely superior if the true clusters are not perfectly spherical and separated.
Explanation:
K-Means is a hard-assignment algorithm that assumes clusters are isotropic (spherical) and of similar size. It can only find linear decision boundaries between clusters. GMM is a soft-assignment algorithm that generalizes K-Means. It can model non-spherical (elliptical) clusters by estimating a full covariance matrix for each component. The fact that GMM found overlapping, elliptical clusters suggests this is a more accurate representation of the underlying data structure than the simplistic, spherical assumption of K-Means. GMM's probabilistic assignments also provide a measure of uncertainty for points in the overlapping regions.
Incorrect! Try again.
52You are building a model to predict house prices and have a 'zip_code' categorical feature with over 1000 unique values. Why is one-hot encoding this feature for a linear regression model often a poor choice, and what is a more effective (though potentially risky) alternative?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.Poor choice: It introduces perfect multicollinearity into the feature matrix. Alternative: Using ordinal encoding by sorting zip codes numerically.
B.Poor choice: It creates a very high-dimensional, sparse feature space (curse of dimensionality) which can hurt model performance and interpretability. Alternative: Target encoding, where each zip code is replaced by the average house price within that zip code.
C.Poor choice: It violates the independence assumption of linear regression. Alternative: Hashing the feature into a smaller number of dimensions.
D.Poor choice: It cannot be used in a linear model, only in tree-based models. Alternative: Deleting the feature from the dataset entirely.
Correct Answer: Poor choice: It creates a very high-dimensional, sparse feature space (curse of dimensionality) which can hurt model performance and interpretability. Alternative: Target encoding, where each zip code is replaced by the average house price within that zip code.
Explanation:
One-hot encoding a high-cardinality categorical feature like 'zip_code' leads to the curse of dimensionality, creating thousands of new binary features. This sparsity makes it difficult for a linear model to learn robust weights, especially for zip codes with few samples. A powerful alternative is target encoding (or mean encoding). This replaces the categorical feature with a single numerical feature representing the average target value for that category. This directly encodes information about the target variable, making it highly predictive. The risk is data leakage and overfitting if not implemented carefully (e.g., by calculating the means on the training set only and applying them to the validation/test sets, or by using a cross-validation scheme).
Incorrect! Try again.
53You are modeling a system with a Naive Bayes classifier. The 'naive' assumption is that all features are conditionally independent given the class, i.e., . If two features, and , are actually perfectly correlated, how does this violation of the assumption affect the posterior probability calculation for class Y?
Bayesian networks, and probabilistic reasoning
Hard
A.It will cause the model to ignore one of the features, as the information is redundant, leading to a loss of signal.
B.The violation has no effect on the rank-ordering of the posterior probabilities, so the final classification decision remains optimal.
C.It will 'double-count' the evidence from the correlated features, leading to posterior probabilities that are unjustifiably extreme (pushed towards 0 or 1).
D.The model's posterior probability calculation will fail due to a division by zero error caused by the linear dependency.
Correct Answer: It will 'double-count' the evidence from the correlated features, leading to posterior probabilities that are unjustifiably extreme (pushed towards 0 or 1).
Explanation:
The Naive Bayes classifier calculates the posterior as . If and provide the same information (perfect correlation), including both and in the product is like including the same piece of evidence twice. This squaring effect on the likelihood term will artificially amplify the evidence, pushing the calculated posterior probabilities towards 0 or 1 and making the model overly confident in its predictions.
Incorrect! Try again.
54In a recommender system using truncated Singular Value Decomposition (SVD) on the user-item matrix , what is the geometric interpretation of predicting a user's rating for an unseen item?
Linear algebra (applied focus)
Hard
A.It is calculated by projecting the user's vector onto the principal components of the item-item covariance matrix.
B.It is the cosine similarity between the user's latent factor vector (from ) and the item's latent factor vector (from ).
C.It involves finding the Euclidean distance between the user's vector and the item's vector in the original high-dimensional space.
D.It is equivalent to taking the dot product of the user's vector in the k-dimensional latent space (a row in ) and the item's vector in the same space (a row in ).
Correct Answer: It is equivalent to taking the dot product of the user's vector in the k-dimensional latent space (a row in ) and the item's vector in the same space (a row in ).
Explanation:
The reconstructed matrix provides the predictions. An individual entry (rating of user i for item j) is the dot product of the i-th row of and the j-th row of (which is the j-th column of ). Geometrically, this means we represent both users and items as vectors in a common k-dimensional latent space. A high rating is predicted when the user and item vectors are closely aligned and have large magnitudes in this space, as captured by the dot product.
Incorrect! Try again.
55You have trained a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. You observe that the model has very high accuracy on the training set but poor accuracy on the test set. Which hyperparameter adjustments are most likely to mitigate this overfitting?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.Decrease the regularization parameter C and increase the kernel coefficient gamma.
B.Increase the regularization parameter C and decrease the kernel coefficient gamma.
C.Decrease the regularization parameter C and decrease the kernel coefficient gamma.
D.Increase the regularization parameter C and increase the kernel coefficient gamma.
Correct Answer: Decrease the regularization parameter C and decrease the kernel coefficient gamma.
Explanation:
Overfitting in an RBF SVM indicates the decision boundary is too complex and sensitive to individual data points.
The C parameter is the regularization parameter. A high C value penalizes misclassifications heavily, leading to a complex, tight-fitting boundary. Decreasing C allows for a 'softer' margin, tolerating more misclassifications in the training set to achieve a simpler, more generalizable decision boundary.
The gamma parameter defines the influence of a single training example. A high gamma leads to a very localized influence, creating a complex, 'spiky' boundary. Decreasing gamma makes the influence of each support vector broader, resulting in a smoother, less complex boundary. Both actions serve to regularize the model and combat overfitting.
Incorrect! Try again.
56In a binary classification problem with severe class imbalance, where the positive class is rare but of high importance, why is the Area Under the ROC Curve (AUC-ROC) potentially a misleading metric compared to the Area Under the Precision-Recall Curve (AUC-PR)?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.AUC-ROC includes True Negatives in its calculation (via the False Positive Rate), and in an imbalanced setting, a model can achieve a high score by simply correctly identifying the overwhelmingly large number of true negatives.
B.AUC-PR is insensitive to the decision threshold, whereas AUC-ROC is highly sensitive, making AUC-PR more robust.
C.AUC-ROC is only applicable to linear models, while AUC-PR can be used for any classifier.
D.AUC-ROC assumes that the costs of false positives and false negatives are equal.
Correct Answer: AUC-ROC includes True Negatives in its calculation (via the False Positive Rate), and in an imbalanced setting, a model can achieve a high score by simply correctly identifying the overwhelmingly large number of true negatives.
Explanation:
The ROC curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR). FPR is calculated as . In a highly imbalanced dataset, the number of True Negatives (TN) is massive. A model can generate a large number of False Positives (FP) without making the FPR significantly large, resulting in a deceptively optimistic AUC-ROC score. The Precision-Recall curve, however, plots Precision () vs. Recall (TPR). It does not use TN in its calculation. It focuses directly on the model's performance on the minority (positive) class, making it a much more informative metric for tasks like fraud or disease detection.
Incorrect! Try again.
57In a Bayesian network, if a directed path from node X to node Z exists, such as , what is the effect of conditioning on node Y?
Bayesian networks, and probabilistic reasoning
Hard
A.It makes X and Z conditionally dependent.
B.It has no effect on the relationship between X and Z.
C.It makes X and Z conditionally independent.
D.It reverses the direction of influence from Z to X.
Correct Answer: It makes X and Z conditionally independent.
Explanation:
This structure is a 'chain'. According to the rules of d-separation, a path is blocked if it contains a chain () where the middle node (B) is conditioned on. Intuitively, all information from X that influences Z must pass through Y. Once the state of Y is known, X provides no additional information about Z. Therefore, conditioning on Y renders X and Z conditionally independent.
Incorrect! Try again.
58You are building a linear regression model and suspect multicollinearity. You observe that the Variance Inflation Factors (VIFs) for several predictors are very high (>10). How does Ridge Regression (L2 regularization) specifically address the mathematical instability caused by this issue?
Statistics
Hard
A.By performing feature selection and setting the coefficients of correlated predictors to exactly zero.
B.By adding a positive value (X^T X$ matrix before inversion, making the matrix non-singular and stable even with correlated features.
C.By transforming the correlated features into a new set of orthogonal features using PCA before fitting the model.
D.By using a robust loss function that is less sensitive to the large coefficients that result from multicollinearity.
Correct Answer: By adding a positive value (X^T X$ matrix before inversion, making the matrix non-singular and stable even with correlated features.
Explanation:
The solution for Ridge Regression is . Multicollinearity causes the matrix to be nearly singular (ill-conditioned), meaning some of its eigenvalues are close to zero. The inversion of such a matrix is numerically unstable. By adding (a positive constant times the identity matrix), we are effectively adding to each eigenvalue of . This shifts all eigenvalues away from zero, guaranteeing that the matrix is invertible and well-conditioned, thus stabilizing the solution. This process shrinks the resulting coefficients, reducing their variance.
Incorrect! Try again.
59In the context of deep learning, what is the critical difference between semi-supervised learning and transfer learning?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.Semi-supervised learning requires at least two different models to be trained, whereas transfer learning only requires fine-tuning one model.
B.Transfer learning is a form of unsupervised learning, while semi-supervised learning is a form of supervised learning.
C.Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data from the same task to improve performance, while transfer learning uses knowledge (e.g., weights) from a different, pre-trained task to bootstrap learning on a new task.
D.Transfer learning is used for reinforcement learning problems, while semi-supervised learning is used for classification problems.
Correct Answer: Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data from the same task to improve performance, while transfer learning uses knowledge (e.g., weights) from a different, pre-trained task to bootstrap learning on a new task.
Explanation:
The key distinction is the source and purpose of the 'extra' data/knowledge. Semi-supervised learning leverages unlabeled data from the target domain to better understand its underlying structure, helping the model generalize from the few labeled examples it has for that same domain. Transfer learning leverages a model (and its learned features) trained on a different, often much larger, dataset (e.g., ImageNet) and applies it to a new target task (e.g., medical image classification), which may have limited labeled data. It's about transferring knowledge across tasks, not leveraging unlabeled data for the same task.
Incorrect! Try again.
60You are building a time-series forecasting model to predict next month's sales based on the previous 12 months. You decide to use 5-fold cross-validation by randomly shuffling and splitting your 5 years of monthly data. Why is this a critically flawed evaluation strategy?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.It reduces the amount of training data available in each fold, leading to a high-bias model.
B.It violates the i.i.d. (independent and identically distributed) assumption, which is a necessary condition for all machine learning models.
C.Random shuffling is computationally inefficient for time-series data compared to a simple chronological split.
D.It causes severe data leakage, as the model will be trained on data from the future to make 'predictions' about the past, leading to an unrealistically optimistic performance estimate.
Correct Answer: It causes severe data leakage, as the model will be trained on data from the future to make 'predictions' about the past, leading to an unrealistically optimistic performance estimate.
Explanation:
The most critical flaw is data leakage. Time-series data has an inherent temporal order. By randomly shuffling, a fold's training set could contain data from 2022, while its validation set contains data from 2021. The model learns from information that would not have been available at the time of the 'prediction', a situation that is impossible in a real-world deployment. This leakage leads to performance metrics that are artificially inflated and do not reflect the model's true ability to forecast the future. The correct approach is to use a method that respects temporal order, such as walk-forward validation or time-series split.