1Which type of machine learning uses labeled data, where each data point is tagged with a correct output, to train a model?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Reinforcement Learning
B.Semi-supervised Learning
C.Supervised Learning
D.Unsupervised Learning
Correct Answer: Supervised Learning
Explanation:
Supervised learning is characterized by its use of labeled datasets. The algorithm learns from input-output pairs to make predictions on new, unseen data.
Incorrect! Try again.
2What measure of central tendency represents the most frequently occurring value in a dataset?
Statistics
Easy
A.Range
B.Mode
C.Median
D.Mean
Correct Answer: Mode
Explanation:
The mode is the value that appears most often in a set of data. The mean is the average, and the median is the middle value when the data is sorted.
Incorrect! Try again.
3If you roll a single fair six-sided die, what is the probability of rolling an even number (2, 4, or 6)?
Probability
Easy
A.1/2
B.2/3
C.1/3
D.1/6
Correct Answer: 1/2
Explanation:
A six-sided die has three even numbers (2, 4, 6) out of six possible outcomes. Therefore, the probability is 3/6, which simplifies to 1/2.
Incorrect! Try again.
4In machine learning, a collection of numbers representing a single data point (e.g., height, weight, and age of a person) is typically stored as a:
Linear algebra (applied focus)
Easy
A.Scalar
B.Matrix
C.Vector
D.Tensor
Correct Answer: Vector
Explanation:
A vector is a 1-dimensional array of numbers, which is a convenient way to represent the features of a single observation or data point.
Incorrect! Try again.
5What is the primary purpose of cross-validation in model evaluation?
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.To always increase the model's accuracy on the training data
B.To make the model train faster
C.To assess how a model will generalize to an independent, unseen dataset
D.To simplify the features of the data
Correct Answer: To assess how a model will generalize to an independent, unseen dataset
Explanation:
Cross-validation is a technique for evaluating a model's performance by training it on subsets of the data and testing it on the complementary subset, giving a more robust estimate of its performance on new data.
Incorrect! Try again.
6Grouping customers into different segments based on their purchasing habits, without any predefined labels for the groups, is a classic example of:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Reinforcement Learning
B.Unsupervised Learning
C.Regression
D.Supervised Learning
Correct Answer: Unsupervised Learning
Explanation:
This task is known as clustering, where the goal is to find hidden patterns or structures in unlabeled data. Since there are no predefined labels for the customer groups, it is an unsupervised learning problem.
Incorrect! Try again.
7Bayes' Theorem is used to update the probability of a hypothesis based on new evidence. What is the term for this updated probability?
Bayes theorem
Easy
A.Marginal probability
B.Prior probability
C.Posterior probability
D.Likelihood
Correct Answer: Posterior probability
Explanation:
The posterior probability, often written as , is the revised probability of a hypothesis H after observing evidence E. It's the result of applying Bayes' theorem.
Incorrect! Try again.
8A 2-dimensional grid of numbers, often used to represent an entire dataset where rows are data points and columns are features, is called a:
Linear algebra (applied focus)
Easy
A.Matrix
B.Vector
C.Scalar
D.Diagonal
Correct Answer: Matrix
Explanation:
A matrix is a rectangular array of numbers arranged in rows and columns. It's the standard way to represent a tabular dataset in machine learning.
Incorrect! Try again.
9In the context of a binary classification model, what does 'Precision' measure?
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.The proportion of actual positive instances that were correctly identified
B.The proportion of positive predictions that were actually correct
C.The model's ability to correctly identify negative instances
D.The overall accuracy of the model across all classes
Correct Answer: The proportion of positive predictions that were actually correct
Explanation:
Precision asks the question: 'Of all the instances the model predicted as positive, how many were actually positive?'. It is calculated as , where TP is True Positives and FP is False Positives.
Incorrect! Try again.
10A self-driving car AI learning to navigate by receiving a 'reward' for a correct action and a 'penalty' for a mistake is using which type of machine learning?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Supervised Learning
B.Dimensionality Reduction
C.Reinforcement Learning
D.Unsupervised Learning
Correct Answer: Reinforcement Learning
Explanation:
Reinforcement learning involves an 'agent' learning to make decisions by performing actions in an environment to maximize a cumulative reward. This trial-and-error approach with rewards and penalties is the core of RL.
Incorrect! Try again.
11What is the 'median' of the following set of numbers: [1, 7, 3, 9, 5]?
Statistics
Easy
A.7
B.3
C.5
D.25
Correct Answer: 5
Explanation:
To find the median, you must first sort the numbers: [1, 3, 5, 7, 9]. The median is the middle value in the sorted list, which is 5.
Incorrect! Try again.
12What does a node in a Bayesian Network typically represent?
Bayesian networks, and probabilistic reasoning
Easy
A.A deterministic value
B.An entire dataset
C.A machine learning algorithm
D.A random variable
Correct Answer: A random variable
Explanation:
A Bayesian Network is a probabilistic graphical model. Each node (or vertex) in the graph represents a random variable, and the edges represent conditional dependencies between them.
Incorrect! Try again.
13The probability of any event is always a number between:
Probability
Easy
A.0 and 100 (inclusive)
B.0 and 1 (inclusive)
C.1 and infinity
D.-1 and 1 (inclusive)
Correct Answer: 0 and 1 (inclusive)
Explanation:
Probability is a measure of the likelihood of an event occurring. A probability of 0 means the event is impossible, and a probability of 1 means the event is certain. It cannot be negative or greater than 1.
Incorrect! Try again.
14The process of selecting, transforming, or creating the most suitable input variables for a machine learning model is called:
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.Cross-validation
B.Algorithm selection
C.Feature engineering
D.Model evaluation
Correct Answer: Feature engineering
Explanation:
Feature engineering is a crucial pre-modeling step that uses domain knowledge to create features that make machine learning algorithms work better. It directly impacts model performance.
Incorrect! Try again.
15Predicting the exact price of a house based on its size, location, and number of bedrooms is an example of what kind of problem?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Regression
B.Association
C.Clustering
D.Classification
Correct Answer: Regression
Explanation:
Regression is a type of supervised learning task where the goal is to predict a continuous numerical value (like a price), rather than a discrete category.
Incorrect! Try again.
16What is a scalar in the context of linear algebra?
Linear algebra (applied focus)
Easy
A.A type of model
B.An array of numbers
C.A single number
D.A grid of numbers
Correct Answer: A single number
Explanation:
A scalar is simply an ordinary number (an integer or a real number). It is used to scale vectors and matrices through multiplication.
Incorrect! Try again.
17Email spam detection, where an algorithm is trained on emails already labeled as 'spam' or 'not spam', is a task known as:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Easy
A.Classification
B.Clustering
C.Regression
D.Reinforcement Learning
Correct Answer: Classification
Explanation:
Classification is a supervised learning task where the model learns to assign a category or class label (e.g., 'spam' or 'not spam') to a new observation.
Incorrect! Try again.
18In a dataset of student test scores, the difference between the highest and lowest score is known as the:
Statistics
Easy
A.Mean
B.Standard Deviation
C.Variance
D.Range
Correct Answer: Range
Explanation:
The range is the simplest measure of statistical dispersion or variability. It is calculated by subtracting the minimum value from the maximum value in a dataset.
Incorrect! Try again.
19In a Bayesian Network, what does a directed edge (an arrow) from Node A to Node B signify?
Bayesian networks, and probabilistic reasoning
Easy
A.Node A and Node B have the same probability distribution
B.The state of Node A directly influences the probability of the state of Node B
C.The state of Node B directly influences the probability of the state of Node A
D.Node A and Node B are completely independent
Correct Answer: The state of Node A directly influences the probability of the state of Node B
Explanation:
The directed edges in a Bayesian Network represent conditional dependencies. An arrow from A to B means that the probability of B is dependent on the value of A. A is considered a 'parent' of B.
Incorrect! Try again.
20A 'False Positive' in a medical test designed to detect a disease means:
Feature engineering and model evaluation (cross-validation, precision, recall)
Easy
A.The test incorrectly indicates a sick person is healthy
B.The test correctly indicates a sick person has the disease
C.The test incorrectly indicates a healthy person has the disease
D.The test correctly indicates a healthy person is healthy
Correct Answer: The test incorrectly indicates a healthy person has the disease
Explanation:
A False Positive (also called a Type I error) is an outcome where the model incorrectly predicts the positive class. In this case, 'having the disease' is the positive class, so the test falsely identifies a healthy person as being sick.
Incorrect! Try again.
21A model designed to detect a rare but critical disease has a high precision of 95% but a very low recall of 10%. What is the most accurate interpretation of this result?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.The model is highly reliable when it predicts a patient has the disease, but it misses most of the actual positive cases.
B.The model correctly identifies most of the patients who have the disease.
C.The model has a high overall accuracy and is performing well.
D.The model incorrectly flags many healthy patients as having the disease.
Correct Answer: The model is highly reliable when it predicts a patient has the disease, but it misses most of the actual positive cases.
Explanation:
High precision (True Positives / (True Positives + False Positives)) means that when the model predicts a positive case, it is very likely to be correct. However, low recall (True Positives / (True Positives + False Negatives)) means the model fails to identify a large number of actual positive cases (it has many false negatives). This is a critical issue in medical diagnosis where failing to detect a disease is often more dangerous than a false alarm.
Incorrect! Try again.
22An e-commerce company wants to group its customers into distinct segments based on purchasing behavior (e.g., frequency, items bought, total spending) without any predefined labels for these segments. Which type of machine learning is most suitable for this task?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Semi-Supervised Learning
B.Supervised Learning
C.Reinforcement Learning
D.Unsupervised Learning
Correct Answer: Unsupervised Learning
Explanation:
This is a classic clustering problem. Since there are no predefined labels for the customer segments, the goal is to discover inherent structures or patterns in the data. Unsupervised learning algorithms, such as K-Means or DBSCAN, are designed for this purpose.
Incorrect! Try again.
23In Principal Component Analysis (PCA), the eigenvectors of the data's covariance matrix represent the...
Linear algebra (applied focus)
Medium
A.variance of each principal component.
B.average value of each feature.
C.number of clusters in the data.
D.directions of maximum variance in the data.
Correct Answer: directions of maximum variance in the data.
Explanation:
PCA works by finding a new set of orthogonal axes, called principal components, that align with the directions of maximum variance in the dataset. These directions are defined by the eigenvectors of the covariance matrix. The first principal component corresponds to the eigenvector with the largest eigenvalue, capturing the most variance.
Incorrect! Try again.
24A medical test for a disease has a 99% accuracy rate (it's correct 99% of the time). The disease has a prevalence of 1 in 10,000 people. If a randomly selected person tests positive, what can you conclude about the probability that they actually have the disease?
Bayes theorem, Bayesian networks, and probabilistic reasoning
Medium
A.The probability is actually quite low (much less than 50%).
B.It is impossible to determine without knowing the false negative rate.
C.The probability is very high, but slightly less than 99%.
D.The probability is 99%.
Correct Answer: The probability is actually quite low (much less than 50%).
Explanation:
This is a classic example of the base rate fallacy, solved using Bayes' theorem. Because the disease is so rare, the number of false positives from the large healthy population (9,999 out of 10,000 people will generate about 100 false positives) will far outnumber the true positives from the small sick population (1 out of 10,000 will likely generate 1 true positive). Therefore, a positive test result is more likely to be a false positive than a true positive.
Incorrect! Try again.
25A dataset of employee salaries at a tech company is heavily right-skewed due to a few extremely high executive salaries. Which measure of central tendency would provide the most realistic representation of a 'typical' employee's salary?
Statistics
Medium
A.Mean
B.Mode
C.Standard Deviation
D.Median
Correct Answer: Median
Explanation:
The mean is highly sensitive to outliers and extreme values. In a right-skewed distribution, the mean is pulled upwards by the high values, making it an inflated representation of the central tendency. The median, which is the middle value when data is sorted, is robust to outliers and will give a much better indication of the typical salary.
Incorrect! Try again.
26A program learns to play chess by making moves and receiving a reward of +1 for a win, -1 for a loss, and 0 for a draw after each game. The program's goal is to maximize its cumulative reward over many games. This scenario is a prime example of:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Clustering (Unsupervised Learning)
B.Reinforcement Learning
C.Classification (Supervised Learning)
D.Regression (Supervised Learning)
Correct Answer: Reinforcement Learning
Explanation:
This problem has all the key elements of reinforcement learning: an agent (the program) interacting with an environment (the chess game), taking actions (making moves), and receiving delayed rewards (the outcome of the game). The agent learns a policy (a strategy for choosing moves) to maximize its long-term reward.
Incorrect! Try again.
27To evaluate a model's performance and generalizability while tuning its hyperparameters, the standard practice is to split the data into three sets. What is the primary purpose of the 'validation' set?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.To provide a final, unbiased evaluation of the model's performance on unseen data.
B.To increase the amount of data available for training.
C.To train the final model after hyperparameters have been chosen.
D.To select the best model hyperparameters without 'leaking' information from the test set.
Correct Answer: To select the best model hyperparameters without 'leaking' information from the test set.
Explanation:
The training set is used to fit the model. The validation set is used to evaluate different hyperparameter settings and choose the best-performing model. The test set is kept completely separate until the very end to provide a final, unbiased estimate of how the chosen model will perform on new, unseen data. Using the test set for hyperparameter tuning would lead to an overly optimistic performance estimate.
Incorrect! Try again.
28In natural language processing, words are often represented as high-dimensional vectors (word embeddings). The cosine similarity between two word vectors measures:
Linear algebra (applied focus)
Medium
A.The number of characters the words have in common.
B.The difference in word frequency.
C.The semantic similarity or relatedness of the words.
D.The Euclidean distance between the words in the vector space.
Correct Answer: The semantic similarity or relatedness of the words.
Explanation:
Cosine similarity measures the cosine of the angle between two vectors. In the context of word embeddings, vectors for semantically similar words (e.g., 'king' and 'queen') are designed to point in similar directions. A cosine similarity close to 1 indicates a small angle and thus high semantic similarity, while a value close to 0 indicates orthogonality or dissimilarity.
Incorrect! Try again.
29In a classification problem, the output of a logistic regression model for a given input is 0.7. What does this value represent?
Probability
Medium
A.The probability that the input belongs to the positive class.
B.The predicted class label is 0.7.
C.The margin of separation from the decision boundary.
D.The accuracy of the model on this specific input.
Correct Answer: The probability that the input belongs to the positive class.
Explanation:
Logistic regression models the probability that an input belongs to a particular class. The sigmoid function at its output squashes any real-valued number into the range [0, 1], which is interpreted as a probability. A value of 0.7 means the model estimates a 70% probability that the sample belongs to the positive class (class 1).
Incorrect! Try again.
30In a simple Bayesian Network representing a medical diagnosis, we have the structure: Disease -> Symptom. If we know a patient has the Disease, what does this tell us about the probability of them having the Symptom?
Bayes theorem, Bayesian networks, and probabilistic reasoning
Medium
A.The probability of the Symptom is now conditioned on the presence of the Disease, and is given by .
B.Knowing the patient has the Disease makes the Symptom certain to occur.
C.The probability of the Symptom becomes 0.
D.Knowing the patient has the Disease does not change the probability of the Symptom.
Correct Answer: The probability of the Symptom is now conditioned on the presence of the Disease, and is given by .
Explanation:
The arrow from Disease to Symptom indicates a direct causal or influential relationship, where the state of Disease affects the probability of Symptom. The network encodes the conditional probability . Observing that the patient has the disease means we use this conditional probability table to update our belief about the likelihood of the symptom.
Incorrect! Try again.
31What is the primary motivation for standardizing features (e.g., using Z-score normalization) before applying distance-based algorithms like K-Nearest Neighbors (KNN)?
Statistics
Medium
A.To reduce the number of features in the dataset.
B.To convert all features into a [0, 1] range.
C.To prevent features with larger scales from dominating the distance calculations.
D.To make the data conform to a normal distribution.
Correct Answer: To prevent features with larger scales from dominating the distance calculations.
Explanation:
KNN calculates distances (like Euclidean distance) between data points. If one feature has a much larger scale than others (e.g., salary in dollars vs. years of experience), its contribution to the distance calculation will overwhelm the others. Standardization rescales features to have a mean of 0 and a standard deviation of 1, ensuring that all features contribute more equally to the distance metric.
Incorrect! Try again.
32If you multiply a 2D vector by the matrix , what geometric transformation is applied to the vector?
Linear algebra (applied focus)
Medium
A.A scaling by a factor of 2.
B.A 90-degree counter-clockwise rotation.
C.A reflection across the y-axis.
D.A projection onto the x-axis.
Correct Answer: A 90-degree counter-clockwise rotation.
Explanation:
Let's apply the transformation to a basis vector, e.g., . The result is . The vector on the x-axis is rotated to the y-axis. Applying it to gives . This consistent behavior demonstrates a 90-degree counter-clockwise rotation around the origin.
Incorrect! Try again.
33You perform 5-fold cross-validation to evaluate your machine learning model. If your dataset has 1000 instances, how many instances are in the training set for each of the 5 iterations?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.200
B.1000
C.800
D.500
Correct Answer: 800
Explanation:
In k-fold cross-validation, the dataset is divided into k equal (or nearly equal) folds. For each iteration, one fold is used as the validation set, and the remaining k-1 folds are used as the training set. With 1000 instances and k=5, each fold has 1000/5 = 200 instances. Therefore, the training set in each iteration will consist of 4 folds, which is 4 * 200 = 800 instances.
Incorrect! Try again.
34Which of the following problems is best framed as a regression task rather than a classification task?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Identifying the species of a flower from a photo.
B.Determining if an email is spam or not spam.
C.Predicting whether a customer will churn (yes/no).
D.Predicting the price of a house based on its features.
Correct Answer: Predicting the price of a house based on its features.
Explanation:
Classification tasks involve predicting a discrete, categorical label (e.g., yes/no, spam/not spam, flower species). Regression tasks involve predicting a continuous numerical value. Predicting a house price, which can be any value within a range, is a quintessential regression problem.
Incorrect! Try again.
35Two events A and B are mutually exclusive. If P(A) = 0.4 and P(B) = 0.3, what is the probability of A or B occurring, i.e., ?
Probability
Medium
A.0.12
B.0.1
C.0.7
D.1.0
Correct Answer: 0.7
Explanation:
The general formula for the union of two events is . For mutually exclusive events, they cannot occur at the same time, so their joint probability is 0. Therefore, the formula simplifies to .
Incorrect! Try again.
36You are building a spam filter. Given that P(Spam) = 0.2, P(contains 'free' | Spam) = 0.8, and P(contains 'free' | Not Spam) = 0.05. Using Bayes' theorem, what are you trying to calculate?
Bayes theorem, Bayesian networks, and probabilistic reasoning
Medium
A.P(contains 'free')
B.P(Spam | contains 'free')
C.P(Spam, contains 'free')
D.P(Not Spam)
Correct Answer: P(Spam | contains 'free')
Explanation:
The goal of a spam filter is to determine the probability that an email is spam given that we have observed some evidence (like the word 'free'). This is a conditional probability, specifically the posterior probability . Bayes' theorem provides the framework to calculate this using the prior probability P(Spam), the likelihood P(contains 'free' | Spam), and the evidence P(contains 'free').
Incorrect! Try again.
37What is the primary risk of performing feature selection based on the model's performance on the final test set?
Feature engineering and model evaluation (cross-validation, precision, recall)
Medium
A.It is computationally too expensive.
B.It can lead to a model that is too simple (underfitting).
C.It requires the data to be normally distributed.
D.It causes information from the test set to leak into the model selection process, leading to an over-optimistic performance estimate.
Correct Answer: It causes information from the test set to leak into the model selection process, leading to an over-optimistic performance estimate.
Explanation:
The test set must be held out and used only once for a final, unbiased evaluation. If you use the test set to guide your feature selection (or any other model tuning), you are implicitly fitting your model to the test set. The resulting performance metric will be inflated because the model has already 'seen' the data in some form, and it will not generalize as well to truly new, unseen data.
Incorrect! Try again.
38What does it imply if the determinant of a transformation matrix used in a machine learning model is zero?
Linear algebra (applied focus)
Medium
A.The transformation is a pure rotation.
B.The transformation is an identity operation (no change).
C.The transformation scales the data uniformly.
D.The transformation collapses the data into a lower-dimensional space.
Correct Answer: The transformation collapses the data into a lower-dimensional space.
Explanation:
The determinant of a matrix represents the factor by which area (in 2D) or volume (in 3D) is scaled by the transformation. A determinant of zero means that the area/volume becomes zero. This happens when the transformation squashes the data onto a line (from 2D) or a plane (from 3D), effectively reducing its dimensionality and making the transformation non-invertible.
Incorrect! Try again.
39In a linear regression analysis, you create a plot of residuals versus fitted values and observe a distinct funnel shape (heteroscedasticity). What key assumption of linear regression does this violate?
Statistics
Medium
A.Independence of errors.
B.Normality of errors.
C.Constant variance of errors (homoscedasticity).
D.Linearity of the relationship.
Correct Answer: Constant variance of errors (homoscedasticity).
Explanation:
One of the core assumptions of linear regression is homoscedasticity, which means the variance of the residuals (errors) should be constant across all levels of the independent variables. A funnel shape in the residual plot indicates that the error variance is not constant (it increases or decreases with the fitted values), which is known as heteroscedasticity. This can affect the reliability of the model's coefficient estimates and significance tests.
Incorrect! Try again.
40A hospital has a large dataset of patient records, where a small fraction of the records are labeled with a correct diagnosis by expert doctors, but the vast majority are unlabeled. The goal is to build a diagnostic model using all the available data. This problem is an example of:
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Medium
A.Transfer Learning
B.Active Learning
C.Multi-task Learning
D.Semi-Supervised Learning
Correct Answer: Semi-Supervised Learning
Explanation:
Semi-supervised learning is a paradigm that falls between supervised and unsupervised learning. It is used for problems where you have a small amount of labeled data and a large amount of unlabeled data. The goal is to leverage the structure within the unlabeled data to improve the performance of a model that is initially trained on the small labeled set.
Incorrect! Try again.
41In the context of Principal Component Analysis (PCA), if the covariance matrix of your centered data is a non-identity diagonal matrix (i.e., diagonal entries are positive but not all equal to 1), what can you definitively conclude about the principal components?
Linear algebra (applied focus)
Hard
A.The principal components are aligned with the original feature axes, and the transformation is essentially a scaling, not a rotation.
B.The covariance matrix is singular, and PCA cannot be computed.
C.The principal components will be at a 45-degree angle to the original axes, reflecting an average of the variances.
D.The data is perfectly correlated, and PCA will reduce its dimensionality to one.
Correct Answer: The principal components are aligned with the original feature axes, and the transformation is essentially a scaling, not a rotation.
Explanation:
A diagonal covariance matrix indicates that the original features are already uncorrelated. The goal of PCA is to find a new, orthogonal basis of uncorrelated variables (the principal components). Since the original feature axes already form an orthogonal basis of uncorrelated variables, the principal components will be aligned with these axes. The transformation matrix will be a permutation matrix or the identity matrix, meaning there is no rotation of the data, only scaling based on the variances (the diagonal entries).
Incorrect! Try again.
42You are developing a credit fraud detection model with a dataset where only 0.1% of transactions are fraudulent. After training, you achieve 99.9% accuracy. You then evaluate using the Area Under the Precision-Recall Curve (AUC-PR) and get a score of 0.2. What is the most accurate and nuanced interpretation of these results?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.The high accuracy is a misleading metric due to extreme class imbalance, and the AUC-PR of 0.2, while appearing low, is significantly better than a random baseline and indicates the model has some, albeit imperfect, skill.
B.The model is excellent because the accuracy is nearly perfect, and the low AUC-PR score must be an error in calculation or interpretation.
C.An AUC-PR of 0.2 is very poor for any dataset, indicating the model's predictions are no better than random guessing.
D.The model is severely overfitting, as indicated by the large discrepancy between the accuracy score and the AUC-PR score.
Correct Answer: The high accuracy is a misleading metric due to extreme class imbalance, and the AUC-PR of 0.2, while appearing low, is significantly better than a random baseline and indicates the model has some, albeit imperfect, skill.
Explanation:
In a highly imbalanced dataset, a naive model predicting the majority class (non-fraudulent) would achieve 99.9% accuracy, making this metric useless. The AUC-PR is a more informative metric here. The baseline for AUC-PR (the score of a random classifier) is equal to the fraction of positives, which is 0.001. A score of 0.2 is 200 times better than random. Therefore, it correctly identifies that accuracy is misleading and that the model has learned a meaningful signal, even if its performance is not perfect.
Incorrect! Try again.
43In the Bayesian network defined by the structure , which statement about the probabilistic relationship between nodes A and C is correct?
Bayesian networks, and probabilistic reasoning
Hard
A.A and C are marginally dependent but become conditionally independent given B.
B.A and C are marginally independent but become conditionally dependent given B.
C.A and C are both marginally and conditionally independent of each other.
D.A and C are both marginally and conditionally dependent on each other.
Correct Answer: A and C are marginally independent but become conditionally dependent given B.
Explanation:
This structure is known as a 'v-structure' or a 'collider'. Node B is a collider because two arrows point into it. In a v-structure, the path between A and C is blocked by default, making them marginally independent (). However, if we observe or condition on the collider B (or any of its descendants), the path becomes unblocked, and information can flow between A and C. This makes them conditionally dependent. This phenomenon is often called 'explaining away'.
Incorrect! Try again.
44A robotics company wants to train a bipedal robot to walk on varied and unseen terrain. The robot receives sensor data about its joint angles and orientation and a sparse positive reward only when it reaches a destination. It receives a large negative reward if it falls. Which specific class of algorithms within a broader ML paradigm is most appropriate for this task?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.Model-based Reinforcement Learning to create a perfect simulation of the robot's dynamics.
This is a classic Reinforcement Learning problem due to the agent-environment interaction and reward-based learning. Supervised learning is not feasible as there is no labeled dataset of 'correct' movements. Unsupervised learning can help with state representation but doesn't solve the core control problem. Within RL, the dynamics of a bipedal robot on varied terrain are extremely complex and difficult to model accurately, making model-based RL challenging. The problem involves a continuous action space (joint torques/angles) and requires learning a stochastic policy, which makes model-free, policy-based methods like Proximal Policy Optimization (PPO) or Asynchronous Advantage Actor-Critic (A3C) the state-of-the-art and most suitable approach.
Incorrect! Try again.
45When solving a linear regression problem using the normal equation, , you find that the matrix is singular (non-invertible). What is the most likely data-related cause and the mathematical consequence?
Linear algebra (applied focus)
Hard
A.Cause: The target variable y contains extreme outliers. Consequence: The matrix inverse cannot be computed.
B.Cause: Perfect multicollinearity among features. Consequence: The system has infinite solutions for the regression coefficients .
C.Cause: Features are not normalized. Consequence: The inverse operation is numerically unstable, but a unique solution technically exists.
D.Cause: The number of samples is far greater than the number of features. Consequence: The model will underfit the data.
Correct Answer: Cause: Perfect multicollinearity among features. Consequence: The system has infinite solutions for the regression coefficients .
Explanation:
The matrix , known as the Gram matrix, is singular if and only if the columns of X (the features) are linearly dependent. This condition is called perfect multicollinearity. From a linear algebra perspective, if is singular, it does not have an inverse, meaning the normal equation does not yield a unique solution. Instead, it defines a system of linear equations with either no solution or infinite solutions. In the context of least squares, this corresponds to an infinite number of coefficient vectors that all achieve the same minimum squared error.
Incorrect! Try again.
46In a K-fold cross-validation setup, what is the primary statistical trade-off when increasing the value of K from 5 to N (where N is the total number of samples), a technique also known as Leave-One-Out Cross-Validation (LOOCV)?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.The computational cost decreases, but the bias of the estimate increases.
B.The bias of the performance estimate decreases, but its variance increases significantly.
C.Both the bias and variance of the performance estimate decrease, improving reliability.
D.The variance of the performance estimate decreases, but its bias increases.
Correct Answer: The bias of the performance estimate decreases, but its variance increases significantly.
Explanation:
As K increases, the training set size for each fold () approaches the full dataset size N. Models trained on more data are generally better, so the performance estimate becomes less biased (it's a better estimate of the true performance of a model trained on N samples). However, with LOOCV, the N training sets are nearly identical (differing by only one sample). This high correlation between the models trained in each fold leads to a high variance in the final performance estimate. The average of highly correlated variables has a much higher variance than the average of independent variables. Thus, the trade-off is accepting higher variance for lower bias.
Incorrect! Try again.
47A rare disease affects 1 in 10,000 people. A test for the disease is developed with a 99% true positive rate and a 98% true negative rate. If a person tests positive, what is the probability they actually have the disease? The key challenge here is the combination of a rare event and an imperfect test.
Bayes theorem
Hard
A.Approximately 2%
B.Approximately 99%
C.Approximately 0.49%
D.Approximately 50%
Correct Answer: Approximately 0.49%
Explanation:
This requires a careful application of Bayes' theorem. Let D be having the disease, and T be testing positive. We want .
We are given:
(True Positive Rate)
(True Negative Rate), so (False Positive Rate)
Using Bayes' theorem:
This is approximately 0.49%. Even with a seemingly accurate test, the vast number of false positives from the healthy population swamps the true positives from the sick population due to the disease's rarity.
Incorrect! Try again.
48What is the primary motivation for using an off-policy reinforcement learning algorithm like Q-Learning over an on-policy one like SARSA?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.It avoids the need for a discount factor gamma, simplifying the Bellman equation.
B.It guarantees faster convergence by reducing the variance of the value function updates.
C.It allows the agent to learn about the optimal policy while behaving according to an exploratory (sub-optimal) policy.
D.It is a model-based approach, which is more sample-efficient than the model-free on-policy methods.
Correct Answer: It allows the agent to learn about the optimal policy while behaving according to an exploratory (sub-optimal) policy.
Explanation:
The core strength of off-policy learning is the decoupling of the target policy (the policy we want to learn) from the behavior policy (the policy used to generate experience). Q-Learning's update rule uses the max Q-value for the next state, effectively learning about the greedy (optimal) policy, regardless of which action was actually taken by the exploratory behavior policy. SARSA, an on-policy algorithm, updates its Q-values based on the action actually taken, thus learning the value of its current behavior policy (including its exploration steps). This decoupling allows off-policy methods to learn from historical data or from a human expert's actions, which is not possible with on-policy methods.
Incorrect! Try again.
49A dataset has two binary features, and , and a binary class . You observe that and . If you also know that and are conditionally independent given , can be lower than 0.6? Why or why not?
Probability
Hard
A.Yes, this can happen if and are both very high, causing the evidence for to accumulate faster.
B.Yes, but only if the features and are negatively correlated.
C.No, the conditional independence assumption implies that , which would be .
D.No, because the two features independently provide evidence for Y=1, their combined evidence must be stronger than either alone.
Correct Answer: Yes, this can happen if and are both very high, causing the evidence for to accumulate faster.
Explanation:
This is a non-intuitive result related to the base rates and evidence accumulation (Simpson's Paradox can be related). The probability depends on the likelihoods of seeing and for both classes ( and ). While both features individually increase the probability of , it is possible that the combination is extremely common when . If and are both very high (e.g., 0.9), their product under the independence assumption makes the likelihood very high. This strong evidence for can potentially outweigh the evidence for , pushing the posterior probability down, possibly even below 0.6.
Incorrect! Try again.
50You are comparing two complex models (e.g., a deep neural network and a gradient boosting machine) using 5x2 cross-validation and a paired t-test on the results to claim statistical significance. According to Dietterich (1998), why might this statistical test have a higher than desired Type I error rate in this specific machine learning context?
Statistics
Hard
A.The underlying distribution of model accuracies is often not Gaussian, which is a core assumption of the t-test.
B.The training sets in cross-validation are highly overlapping, which violates the independence assumption of the t-test, leading to an underestimation of the true variance.
C.A paired t-test cannot be used to compare two different algorithms; it can only compare one algorithm with different hyperparameters.
D.The 5x2 CV procedure systematically biases the performance estimate in favor of the more complex model.
Correct Answer: The training sets in cross-validation are highly overlapping, which violates the independence assumption of the t-test, leading to an underestimation of the true variance.
Explanation:
The standard t-test assumes that the measurements (the performance differences per fold) are independent. However, in k-fold cross-validation, the training sets for any two folds overlap by a large amount. This means the models produced are not independent, and their performance scores are correlated. This correlation leads to an underestimation of the variance of the average difference. A smaller estimated variance makes the t-statistic larger, increasing the likelihood of rejecting the null hypothesis when it is true (a Type I error). Dietterich's seminal paper showed that this issue is particularly pronounced for standard k-fold CV and proposed the 5x2 CV t-test as a partial remedy, though the core issue of dependence remains a concern in ML model comparison.
Incorrect! Try again.
51You apply both K-Means and a Gaussian Mixture Model (GMM) with 3 components to a dataset. You find that K-Means identifies three well-separated, spherical clusters, while the GMM identifies three overlapping, elliptical clusters. Which statement is the most valid conclusion?
Unsupervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.K-Means is the correct choice because it produced well-separated clusters, indicating the GMM is overfitting the data by creating complex shapes.
B.The results are equivalent, as GMM is just a probabilistic version of K-Means and will always converge to a similar result.
C.Both algorithms have failed; K-Means is too simplistic and GMM is too complex, so a density-based algorithm like DBSCAN should be used.
D.GMM provides a more nuanced result by modeling cluster covariance and providing probabilistic assignments, which is likely superior if the true clusters are not perfectly spherical and separated.
Correct Answer: GMM provides a more nuanced result by modeling cluster covariance and providing probabilistic assignments, which is likely superior if the true clusters are not perfectly spherical and separated.
Explanation:
K-Means is a hard-assignment algorithm that assumes clusters are isotropic (spherical) and of similar size. It can only find linear decision boundaries between clusters. GMM is a soft-assignment algorithm that generalizes K-Means. It can model non-spherical (elliptical) clusters by estimating a full covariance matrix for each component. The fact that GMM found overlapping, elliptical clusters suggests this is a more accurate representation of the underlying data structure than the simplistic, spherical assumption of K-Means. GMM's probabilistic assignments also provide a measure of uncertainty for points in the overlapping regions.
Incorrect! Try again.
52You are building a model to predict house prices and have a 'zip_code' categorical feature with over 1000 unique values. Why is one-hot encoding this feature for a linear regression model often a poor choice, and what is a more effective (though potentially risky) alternative?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.Poor choice: It cannot be used in a linear model, only in tree-based models. Alternative: Deleting the feature from the dataset entirely.
B.Poor choice: It introduces perfect multicollinearity into the feature matrix. Alternative: Using ordinal encoding by sorting zip codes numerically.
C.Poor choice: It violates the independence assumption of linear regression. Alternative: Hashing the feature into a smaller number of dimensions.
D.Poor choice: It creates a very high-dimensional, sparse feature space (curse of dimensionality) which can hurt model performance and interpretability. Alternative: Target encoding, where each zip code is replaced by the average house price within that zip code.
Correct Answer: Poor choice: It creates a very high-dimensional, sparse feature space (curse of dimensionality) which can hurt model performance and interpretability. Alternative: Target encoding, where each zip code is replaced by the average house price within that zip code.
Explanation:
One-hot encoding a high-cardinality categorical feature like 'zip_code' leads to the curse of dimensionality, creating thousands of new binary features. This sparsity makes it difficult for a linear model to learn robust weights, especially for zip codes with few samples. A powerful alternative is target encoding (or mean encoding). This replaces the categorical feature with a single numerical feature representing the average target value for that category. This directly encodes information about the target variable, making it highly predictive. The risk is data leakage and overfitting if not implemented carefully (e.g., by calculating the means on the training set only and applying them to the validation/test sets, or by using a cross-validation scheme).
Incorrect! Try again.
53You are modeling a system with a Naive Bayes classifier. The 'naive' assumption is that all features are conditionally independent given the class, i.e., . If two features, and , are actually perfectly correlated, how does this violation of the assumption affect the posterior probability calculation for class Y?
Bayesian networks, and probabilistic reasoning
Hard
A.It will 'double-count' the evidence from the correlated features, leading to posterior probabilities that are unjustifiably extreme (pushed towards 0 or 1).
B.The violation has no effect on the rank-ordering of the posterior probabilities, so the final classification decision remains optimal.
C.It will cause the model to ignore one of the features, as the information is redundant, leading to a loss of signal.
D.The model's posterior probability calculation will fail due to a division by zero error caused by the linear dependency.
Correct Answer: It will 'double-count' the evidence from the correlated features, leading to posterior probabilities that are unjustifiably extreme (pushed towards 0 or 1).
Explanation:
The Naive Bayes classifier calculates the posterior as . If and provide the same information (perfect correlation), including both and in the product is like including the same piece of evidence twice. This squaring effect on the likelihood term will artificially amplify the evidence, pushing the calculated posterior probabilities towards 0 or 1 and making the model overly confident in its predictions.
Incorrect! Try again.
54In a recommender system using truncated Singular Value Decomposition (SVD) on the user-item matrix , what is the geometric interpretation of predicting a user's rating for an unseen item?
Linear algebra (applied focus)
Hard
A.It is equivalent to taking the dot product of the user's vector in the k-dimensional latent space (a row in ) and the item's vector in the same space (a row in ).
B.It is the cosine similarity between the user's latent factor vector (from ) and the item's latent factor vector (from ).
C.It is calculated by projecting the user's vector onto the principal components of the item-item covariance matrix.
D.It involves finding the Euclidean distance between the user's vector and the item's vector in the original high-dimensional space.
Correct Answer: It is equivalent to taking the dot product of the user's vector in the k-dimensional latent space (a row in ) and the item's vector in the same space (a row in ).
Explanation:
The reconstructed matrix provides the predictions. An individual entry (rating of user i for item j) is the dot product of the i-th row of and the j-th row of (which is the j-th column of ). Geometrically, this means we represent both users and items as vectors in a common k-dimensional latent space. A high rating is predicted when the user and item vectors are closely aligned and have large magnitudes in this space, as captured by the dot product.
Incorrect! Try again.
55You have trained a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. You observe that the model has very high accuracy on the training set but poor accuracy on the test set. Which hyperparameter adjustments are most likely to mitigate this overfitting?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.Increase the regularization parameter C and increase the kernel coefficient gamma.
B.Decrease the regularization parameter C and increase the kernel coefficient gamma.
C.Decrease the regularization parameter C and decrease the kernel coefficient gamma.
D.Increase the regularization parameter C and decrease the kernel coefficient gamma.
Correct Answer: Decrease the regularization parameter C and decrease the kernel coefficient gamma.
Explanation:
Overfitting in an RBF SVM indicates the decision boundary is too complex and sensitive to individual data points.
The C parameter is the regularization parameter. A high C value penalizes misclassifications heavily, leading to a complex, tight-fitting boundary. Decreasing C allows for a 'softer' margin, tolerating more misclassifications in the training set to achieve a simpler, more generalizable decision boundary.
The gamma parameter defines the influence of a single training example. A high gamma leads to a very localized influence, creating a complex, 'spiky' boundary. Decreasing gamma makes the influence of each support vector broader, resulting in a smoother, less complex boundary. Both actions serve to regularize the model and combat overfitting.
Incorrect! Try again.
56In a binary classification problem with severe class imbalance, where the positive class is rare but of high importance, why is the Area Under the ROC Curve (AUC-ROC) potentially a misleading metric compared to the Area Under the Precision-Recall Curve (AUC-PR)?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.AUC-PR is insensitive to the decision threshold, whereas AUC-ROC is highly sensitive, making AUC-PR more robust.
B.AUC-ROC assumes that the costs of false positives and false negatives are equal.
C.AUC-ROC includes True Negatives in its calculation (via the False Positive Rate), and in an imbalanced setting, a model can achieve a high score by simply correctly identifying the overwhelmingly large number of true negatives.
D.AUC-ROC is only applicable to linear models, while AUC-PR can be used for any classifier.
Correct Answer: AUC-ROC includes True Negatives in its calculation (via the False Positive Rate), and in an imbalanced setting, a model can achieve a high score by simply correctly identifying the overwhelmingly large number of true negatives.
Explanation:
The ROC curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR). FPR is calculated as . In a highly imbalanced dataset, the number of True Negatives (TN) is massive. A model can generate a large number of False Positives (FP) without making the FPR significantly large, resulting in a deceptively optimistic AUC-ROC score. The Precision-Recall curve, however, plots Precision () vs. Recall (TPR). It does not use TN in its calculation. It focuses directly on the model's performance on the minority (positive) class, making it a much more informative metric for tasks like fraud or disease detection.
Incorrect! Try again.
57In a Bayesian network, if a directed path from node X to node Z exists, such as , what is the effect of conditioning on node Y?
Bayesian networks, and probabilistic reasoning
Hard
A.It has no effect on the relationship between X and Z.
B.It makes X and Z conditionally independent.
C.It reverses the direction of influence from Z to X.
D.It makes X and Z conditionally dependent.
Correct Answer: It makes X and Z conditionally independent.
Explanation:
This structure is a 'chain'. According to the rules of d-separation, a path is blocked if it contains a chain () where the middle node (B) is conditioned on. Intuitively, all information from X that influences Z must pass through Y. Once the state of Y is known, X provides no additional information about Z. Therefore, conditioning on Y renders X and Z conditionally independent.
Incorrect! Try again.
58You are building a linear regression model and suspect multicollinearity. You observe that the Variance Inflation Factors (VIFs) for several predictors are very high (>10). How does Ridge Regression (L2 regularization) specifically address the mathematical instability caused by this issue?
Statistics
Hard
A.By performing feature selection and setting the coefficients of correlated predictors to exactly zero.
B.By adding a positive value (X^T X$ matrix before inversion, making the matrix non-singular and stable even with correlated features.
C.By using a robust loss function that is less sensitive to the large coefficients that result from multicollinearity.
D.By transforming the correlated features into a new set of orthogonal features using PCA before fitting the model.
Correct Answer: By adding a positive value (X^T X$ matrix before inversion, making the matrix non-singular and stable even with correlated features.
Explanation:
The solution for Ridge Regression is . Multicollinearity causes the matrix to be nearly singular (ill-conditioned), meaning some of its eigenvalues are close to zero. The inversion of such a matrix is numerically unstable. By adding (a positive constant times the identity matrix), we are effectively adding to each eigenvalue of . This shifts all eigenvalues away from zero, guaranteeing that the matrix is invertible and well-conditioned, thus stabilizing the solution. This process shrinks the resulting coefficients, reducing their variance.
Incorrect! Try again.
59In the context of deep learning, what is the critical difference between semi-supervised learning and transfer learning?
Supervised, unsupervised, and reinforcement learning: concepts and real-world use
Hard
A.Transfer learning is used for reinforcement learning problems, while semi-supervised learning is used for classification problems.
B.Semi-supervised learning requires at least two different models to be trained, whereas transfer learning only requires fine-tuning one model.
C.Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data from the same task to improve performance, while transfer learning uses knowledge (e.g., weights) from a different, pre-trained task to bootstrap learning on a new task.
D.Transfer learning is a form of unsupervised learning, while semi-supervised learning is a form of supervised learning.
Correct Answer: Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data from the same task to improve performance, while transfer learning uses knowledge (e.g., weights) from a different, pre-trained task to bootstrap learning on a new task.
Explanation:
The key distinction is the source and purpose of the 'extra' data/knowledge. Semi-supervised learning leverages unlabeled data from the target domain to better understand its underlying structure, helping the model generalize from the few labeled examples it has for that same domain. Transfer learning leverages a model (and its learned features) trained on a different, often much larger, dataset (e.g., ImageNet) and applies it to a new target task (e.g., medical image classification), which may have limited labeled data. It's about transferring knowledge across tasks, not leveraging unlabeled data for the same task.
Incorrect! Try again.
60You are building a time-series forecasting model to predict next month's sales based on the previous 12 months. You decide to use 5-fold cross-validation by randomly shuffling and splitting your 5 years of monthly data. Why is this a critically flawed evaluation strategy?
Feature engineering and model evaluation (cross-validation, precision, recall)
Hard
A.It reduces the amount of training data available in each fold, leading to a high-bias model.
B.Random shuffling is computationally inefficient for time-series data compared to a simple chronological split.
C.It violates the i.i.d. (independent and identically distributed) assumption, which is a necessary condition for all machine learning models.
D.It causes severe data leakage, as the model will be trained on data from the future to make 'predictions' about the past, leading to an unrealistically optimistic performance estimate.
Correct Answer: It causes severe data leakage, as the model will be trained on data from the future to make 'predictions' about the past, leading to an unrealistically optimistic performance estimate.
Explanation:
The most critical flaw is data leakage. Time-series data has an inherent temporal order. By randomly shuffling, a fold's training set could contain data from 2022, while its validation set contains data from 2021. The model learns from information that would not have been available at the time of the 'prediction', a situation that is impossible in a real-world deployment. This leakage leads to performance metrics that are artificially inflated and do not reflect the model's true ability to forecast the future. The correct approach is to use a method that respects temporal order, such as walk-forward validation or time-series split.