Unit 3 - Practice Quiz

INT428 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 Which type of machine learning uses labeled data, where each data point is tagged with a correct output, to train a model?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Reinforcement Learning
B. Semi-supervised Learning
C. Supervised Learning
D. Unsupervised Learning

2 What measure of central tendency represents the most frequently occurring value in a dataset?

Statistics Easy
A. Range
B. Mode
C. Median
D. Mean

3 If you roll a single fair six-sided die, what is the probability of rolling an even number (2, 4, or 6)?

Probability Easy
A. 1/2
B. 2/3
C. 1/3
D. 1/6

4 In machine learning, a collection of numbers representing a single data point (e.g., height, weight, and age of a person) is typically stored as a:

Linear algebra (applied focus) Easy
A. Scalar
B. Matrix
C. Vector
D. Tensor

5 What is the primary purpose of cross-validation in model evaluation?

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. To always increase the model's accuracy on the training data
B. To make the model train faster
C. To assess how a model will generalize to an independent, unseen dataset
D. To simplify the features of the data

6 Grouping customers into different segments based on their purchasing habits, without any predefined labels for the groups, is a classic example of:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Reinforcement Learning
B. Unsupervised Learning
C. Regression
D. Supervised Learning

7 Bayes' Theorem is used to update the probability of a hypothesis based on new evidence. What is the term for this updated probability?

Bayes theorem Easy
A. Marginal probability
B. Prior probability
C. Posterior probability
D. Likelihood

8 A 2-dimensional grid of numbers, often used to represent an entire dataset where rows are data points and columns are features, is called a:

Linear algebra (applied focus) Easy
A. Matrix
B. Vector
C. Scalar
D. Diagonal

9 In the context of a binary classification model, what does 'Precision' measure?

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. The proportion of actual positive instances that were correctly identified
B. The proportion of positive predictions that were actually correct
C. The model's ability to correctly identify negative instances
D. The overall accuracy of the model across all classes

10 A self-driving car AI learning to navigate by receiving a 'reward' for a correct action and a 'penalty' for a mistake is using which type of machine learning?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Supervised Learning
B. Dimensionality Reduction
C. Reinforcement Learning
D. Unsupervised Learning

11 What is the 'median' of the following set of numbers: [1, 7, 3, 9, 5]?

Statistics Easy
A. 7
B. 3
C. 5
D. 25

12 What does a node in a Bayesian Network typically represent?

Bayesian networks, and probabilistic reasoning Easy
A. A deterministic value
B. An entire dataset
C. A machine learning algorithm
D. A random variable

13 The probability of any event is always a number between:

Probability Easy
A. 0 and 100 (inclusive)
B. 0 and 1 (inclusive)
C. 1 and infinity
D. -1 and 1 (inclusive)

14 The process of selecting, transforming, or creating the most suitable input variables for a machine learning model is called:

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. Cross-validation
B. Algorithm selection
C. Feature engineering
D. Model evaluation

15 Predicting the exact price of a house based on its size, location, and number of bedrooms is an example of what kind of problem?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Regression
B. Association
C. Clustering
D. Classification

16 What is a scalar in the context of linear algebra?

Linear algebra (applied focus) Easy
A. A type of model
B. An array of numbers
C. A single number
D. A grid of numbers

17 Email spam detection, where an algorithm is trained on emails already labeled as 'spam' or 'not spam', is a task known as:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Classification
B. Clustering
C. Regression
D. Reinforcement Learning

18 In a dataset of student test scores, the difference between the highest and lowest score is known as the:

Statistics Easy
A. Mean
B. Standard Deviation
C. Variance
D. Range

19 In a Bayesian Network, what does a directed edge (an arrow) from Node A to Node B signify?

Bayesian networks, and probabilistic reasoning Easy
A. Node A and Node B have the same probability distribution
B. The state of Node A directly influences the probability of the state of Node B
C. The state of Node B directly influences the probability of the state of Node A
D. Node A and Node B are completely independent

20 A 'False Positive' in a medical test designed to detect a disease means:

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. The test incorrectly indicates a sick person is healthy
B. The test correctly indicates a sick person has the disease
C. The test incorrectly indicates a healthy person has the disease
D. The test correctly indicates a healthy person is healthy

21 A model designed to detect a rare but critical disease has a high precision of 95% but a very low recall of 10%. What is the most accurate interpretation of this result?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. The model is highly reliable when it predicts a patient has the disease, but it misses most of the actual positive cases.
B. The model correctly identifies most of the patients who have the disease.
C. The model has a high overall accuracy and is performing well.
D. The model incorrectly flags many healthy patients as having the disease.

22 An e-commerce company wants to group its customers into distinct segments based on purchasing behavior (e.g., frequency, items bought, total spending) without any predefined labels for these segments. Which type of machine learning is most suitable for this task?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Semi-Supervised Learning
B. Supervised Learning
C. Reinforcement Learning
D. Unsupervised Learning

23 In Principal Component Analysis (PCA), the eigenvectors of the data's covariance matrix represent the...

Linear algebra (applied focus) Medium
A. variance of each principal component.
B. average value of each feature.
C. number of clusters in the data.
D. directions of maximum variance in the data.

24 A medical test for a disease has a 99% accuracy rate (it's correct 99% of the time). The disease has a prevalence of 1 in 10,000 people. If a randomly selected person tests positive, what can you conclude about the probability that they actually have the disease?

Bayes theorem, Bayesian networks, and probabilistic reasoning Medium
A. The probability is actually quite low (much less than 50%).
B. It is impossible to determine without knowing the false negative rate.
C. The probability is very high, but slightly less than 99%.
D. The probability is 99%.

25 A dataset of employee salaries at a tech company is heavily right-skewed due to a few extremely high executive salaries. Which measure of central tendency would provide the most realistic representation of a 'typical' employee's salary?

Statistics Medium
A. Mean
B. Mode
C. Standard Deviation
D. Median

26 A program learns to play chess by making moves and receiving a reward of +1 for a win, -1 for a loss, and 0 for a draw after each game. The program's goal is to maximize its cumulative reward over many games. This scenario is a prime example of:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Clustering (Unsupervised Learning)
B. Reinforcement Learning
C. Classification (Supervised Learning)
D. Regression (Supervised Learning)

27 To evaluate a model's performance and generalizability while tuning its hyperparameters, the standard practice is to split the data into three sets. What is the primary purpose of the 'validation' set?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. To provide a final, unbiased evaluation of the model's performance on unseen data.
B. To increase the amount of data available for training.
C. To train the final model after hyperparameters have been chosen.
D. To select the best model hyperparameters without 'leaking' information from the test set.

28 In natural language processing, words are often represented as high-dimensional vectors (word embeddings). The cosine similarity between two word vectors measures:

Linear algebra (applied focus) Medium
A. The number of characters the words have in common.
B. The difference in word frequency.
C. The semantic similarity or relatedness of the words.
D. The Euclidean distance between the words in the vector space.

29 In a classification problem, the output of a logistic regression model for a given input is 0.7. What does this value represent?

Probability Medium
A. The probability that the input belongs to the positive class.
B. The predicted class label is 0.7.
C. The margin of separation from the decision boundary.
D. The accuracy of the model on this specific input.

30 In a simple Bayesian Network representing a medical diagnosis, we have the structure: Disease -> Symptom. If we know a patient has the Disease, what does this tell us about the probability of them having the Symptom?

Bayes theorem, Bayesian networks, and probabilistic reasoning Medium
A. The probability of the Symptom is now conditioned on the presence of the Disease, and is given by .
B. Knowing the patient has the Disease makes the Symptom certain to occur.
C. The probability of the Symptom becomes 0.
D. Knowing the patient has the Disease does not change the probability of the Symptom.

31 What is the primary motivation for standardizing features (e.g., using Z-score normalization) before applying distance-based algorithms like K-Nearest Neighbors (KNN)?

Statistics Medium
A. To reduce the number of features in the dataset.
B. To convert all features into a [0, 1] range.
C. To prevent features with larger scales from dominating the distance calculations.
D. To make the data conform to a normal distribution.

32 If you multiply a 2D vector by the matrix , what geometric transformation is applied to the vector?

Linear algebra (applied focus) Medium
A. A scaling by a factor of 2.
B. A 90-degree counter-clockwise rotation.
C. A reflection across the y-axis.
D. A projection onto the x-axis.

33 You perform 5-fold cross-validation to evaluate your machine learning model. If your dataset has 1000 instances, how many instances are in the training set for each of the 5 iterations?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. 200
B. 1000
C. 800
D. 500

34 Which of the following problems is best framed as a regression task rather than a classification task?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Identifying the species of a flower from a photo.
B. Determining if an email is spam or not spam.
C. Predicting whether a customer will churn (yes/no).
D. Predicting the price of a house based on its features.

35 Two events A and B are mutually exclusive. If P(A) = 0.4 and P(B) = 0.3, what is the probability of A or B occurring, i.e., ?

Probability Medium
A. 0.12
B. 0.1
C. 0.7
D. 1.0

36 You are building a spam filter. Given that P(Spam) = 0.2, P(contains 'free' | Spam) = 0.8, and P(contains 'free' | Not Spam) = 0.05. Using Bayes' theorem, what are you trying to calculate?

Bayes theorem, Bayesian networks, and probabilistic reasoning Medium
A. P(contains 'free')
B. P(Spam | contains 'free')
C. P(Spam, contains 'free')
D. P(Not Spam)

37 What is the primary risk of performing feature selection based on the model's performance on the final test set?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. It is computationally too expensive.
B. It can lead to a model that is too simple (underfitting).
C. It requires the data to be normally distributed.
D. It causes information from the test set to leak into the model selection process, leading to an over-optimistic performance estimate.

38 What does it imply if the determinant of a transformation matrix used in a machine learning model is zero?

Linear algebra (applied focus) Medium
A. The transformation is a pure rotation.
B. The transformation is an identity operation (no change).
C. The transformation scales the data uniformly.
D. The transformation collapses the data into a lower-dimensional space.

39 In a linear regression analysis, you create a plot of residuals versus fitted values and observe a distinct funnel shape (heteroscedasticity). What key assumption of linear regression does this violate?

Statistics Medium
A. Independence of errors.
B. Normality of errors.
C. Constant variance of errors (homoscedasticity).
D. Linearity of the relationship.

40 A hospital has a large dataset of patient records, where a small fraction of the records are labeled with a correct diagnosis by expert doctors, but the vast majority are unlabeled. The goal is to build a diagnostic model using all the available data. This problem is an example of:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Transfer Learning
B. Active Learning
C. Multi-task Learning
D. Semi-Supervised Learning

41 In the context of Principal Component Analysis (PCA), if the covariance matrix of your centered data is a non-identity diagonal matrix (i.e., diagonal entries are positive but not all equal to 1), what can you definitively conclude about the principal components?

Linear algebra (applied focus) Hard
A. The principal components are aligned with the original feature axes, and the transformation is essentially a scaling, not a rotation.
B. The covariance matrix is singular, and PCA cannot be computed.
C. The principal components will be at a 45-degree angle to the original axes, reflecting an average of the variances.
D. The data is perfectly correlated, and PCA will reduce its dimensionality to one.

42 You are developing a credit fraud detection model with a dataset where only 0.1% of transactions are fraudulent. After training, you achieve 99.9% accuracy. You then evaluate using the Area Under the Precision-Recall Curve (AUC-PR) and get a score of 0.2. What is the most accurate and nuanced interpretation of these results?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. The high accuracy is a misleading metric due to extreme class imbalance, and the AUC-PR of 0.2, while appearing low, is significantly better than a random baseline and indicates the model has some, albeit imperfect, skill.
B. The model is excellent because the accuracy is nearly perfect, and the low AUC-PR score must be an error in calculation or interpretation.
C. An AUC-PR of 0.2 is very poor for any dataset, indicating the model's predictions are no better than random guessing.
D. The model is severely overfitting, as indicated by the large discrepancy between the accuracy score and the AUC-PR score.

43 In the Bayesian network defined by the structure , which statement about the probabilistic relationship between nodes A and C is correct?

Bayesian networks, and probabilistic reasoning Hard
A. A and C are marginally dependent but become conditionally independent given B.
B. A and C are marginally independent but become conditionally dependent given B.
C. A and C are both marginally and conditionally independent of each other.
D. A and C are both marginally and conditionally dependent on each other.

44 A robotics company wants to train a bipedal robot to walk on varied and unseen terrain. The robot receives sensor data about its joint angles and orientation and a sparse positive reward only when it reaches a destination. It receives a large negative reward if it falls. Which specific class of algorithms within a broader ML paradigm is most appropriate for this task?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Model-based Reinforcement Learning to create a perfect simulation of the robot's dynamics.
B. Model-free, policy-based Reinforcement Learning (e.g., PPO, A3C)
C. Unsupervised Learning (e.g., autoencoders) to learn a low-dimensional representation of the terrain.
D. Supervised Learning with regression to predict optimal joint angles.

45 When solving a linear regression problem using the normal equation, , you find that the matrix is singular (non-invertible). What is the most likely data-related cause and the mathematical consequence?

Linear algebra (applied focus) Hard
A. Cause: The target variable y contains extreme outliers. Consequence: The matrix inverse cannot be computed.
B. Cause: Perfect multicollinearity among features. Consequence: The system has infinite solutions for the regression coefficients .
C. Cause: Features are not normalized. Consequence: The inverse operation is numerically unstable, but a unique solution technically exists.
D. Cause: The number of samples is far greater than the number of features. Consequence: The model will underfit the data.

46 In a K-fold cross-validation setup, what is the primary statistical trade-off when increasing the value of K from 5 to N (where N is the total number of samples), a technique also known as Leave-One-Out Cross-Validation (LOOCV)?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. The computational cost decreases, but the bias of the estimate increases.
B. The bias of the performance estimate decreases, but its variance increases significantly.
C. Both the bias and variance of the performance estimate decrease, improving reliability.
D. The variance of the performance estimate decreases, but its bias increases.

47 A rare disease affects 1 in 10,000 people. A test for the disease is developed with a 99% true positive rate and a 98% true negative rate. If a person tests positive, what is the probability they actually have the disease? The key challenge here is the combination of a rare event and an imperfect test.

Bayes theorem Hard
A. Approximately 2%
B. Approximately 99%
C. Approximately 0.49%
D. Approximately 50%

48 What is the primary motivation for using an off-policy reinforcement learning algorithm like Q-Learning over an on-policy one like SARSA?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. It avoids the need for a discount factor gamma, simplifying the Bellman equation.
B. It guarantees faster convergence by reducing the variance of the value function updates.
C. It allows the agent to learn about the optimal policy while behaving according to an exploratory (sub-optimal) policy.
D. It is a model-based approach, which is more sample-efficient than the model-free on-policy methods.

49 A dataset has two binary features, and , and a binary class . You observe that and . If you also know that and are conditionally independent given , can be lower than 0.6? Why or why not?

Probability Hard
A. Yes, this can happen if and are both very high, causing the evidence for to accumulate faster.
B. Yes, but only if the features and are negatively correlated.
C. No, the conditional independence assumption implies that , which would be .
D. No, because the two features independently provide evidence for Y=1, their combined evidence must be stronger than either alone.

50 You are comparing two complex models (e.g., a deep neural network and a gradient boosting machine) using 5x2 cross-validation and a paired t-test on the results to claim statistical significance. According to Dietterich (1998), why might this statistical test have a higher than desired Type I error rate in this specific machine learning context?

Statistics Hard
A. The underlying distribution of model accuracies is often not Gaussian, which is a core assumption of the t-test.
B. The training sets in cross-validation are highly overlapping, which violates the independence assumption of the t-test, leading to an underestimation of the true variance.
C. A paired t-test cannot be used to compare two different algorithms; it can only compare one algorithm with different hyperparameters.
D. The 5x2 CV procedure systematically biases the performance estimate in favor of the more complex model.

51 You apply both K-Means and a Gaussian Mixture Model (GMM) with 3 components to a dataset. You find that K-Means identifies three well-separated, spherical clusters, while the GMM identifies three overlapping, elliptical clusters. Which statement is the most valid conclusion?

Unsupervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. K-Means is the correct choice because it produced well-separated clusters, indicating the GMM is overfitting the data by creating complex shapes.
B. The results are equivalent, as GMM is just a probabilistic version of K-Means and will always converge to a similar result.
C. Both algorithms have failed; K-Means is too simplistic and GMM is too complex, so a density-based algorithm like DBSCAN should be used.
D. GMM provides a more nuanced result by modeling cluster covariance and providing probabilistic assignments, which is likely superior if the true clusters are not perfectly spherical and separated.

52 You are building a model to predict house prices and have a 'zip_code' categorical feature with over 1000 unique values. Why is one-hot encoding this feature for a linear regression model often a poor choice, and what is a more effective (though potentially risky) alternative?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. Poor choice: It cannot be used in a linear model, only in tree-based models. Alternative: Deleting the feature from the dataset entirely.
B. Poor choice: It introduces perfect multicollinearity into the feature matrix. Alternative: Using ordinal encoding by sorting zip codes numerically.
C. Poor choice: It violates the independence assumption of linear regression. Alternative: Hashing the feature into a smaller number of dimensions.
D. Poor choice: It creates a very high-dimensional, sparse feature space (curse of dimensionality) which can hurt model performance and interpretability. Alternative: Target encoding, where each zip code is replaced by the average house price within that zip code.

53 You are modeling a system with a Naive Bayes classifier. The 'naive' assumption is that all features are conditionally independent given the class, i.e., . If two features, and , are actually perfectly correlated, how does this violation of the assumption affect the posterior probability calculation for class Y?

Bayesian networks, and probabilistic reasoning Hard
A. It will 'double-count' the evidence from the correlated features, leading to posterior probabilities that are unjustifiably extreme (pushed towards 0 or 1).
B. The violation has no effect on the rank-ordering of the posterior probabilities, so the final classification decision remains optimal.
C. It will cause the model to ignore one of the features, as the information is redundant, leading to a loss of signal.
D. The model's posterior probability calculation will fail due to a division by zero error caused by the linear dependency.

54 In a recommender system using truncated Singular Value Decomposition (SVD) on the user-item matrix , what is the geometric interpretation of predicting a user's rating for an unseen item?

Linear algebra (applied focus) Hard
A. It is equivalent to taking the dot product of the user's vector in the k-dimensional latent space (a row in ) and the item's vector in the same space (a row in ).
B. It is the cosine similarity between the user's latent factor vector (from ) and the item's latent factor vector (from ).
C. It is calculated by projecting the user's vector onto the principal components of the item-item covariance matrix.
D. It involves finding the Euclidean distance between the user's vector and the item's vector in the original high-dimensional space.

55 You have trained a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. You observe that the model has very high accuracy on the training set but poor accuracy on the test set. Which hyperparameter adjustments are most likely to mitigate this overfitting?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Increase the regularization parameter C and increase the kernel coefficient gamma.
B. Decrease the regularization parameter C and increase the kernel coefficient gamma.
C. Decrease the regularization parameter C and decrease the kernel coefficient gamma.
D. Increase the regularization parameter C and decrease the kernel coefficient gamma.

56 In a binary classification problem with severe class imbalance, where the positive class is rare but of high importance, why is the Area Under the ROC Curve (AUC-ROC) potentially a misleading metric compared to the Area Under the Precision-Recall Curve (AUC-PR)?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. AUC-PR is insensitive to the decision threshold, whereas AUC-ROC is highly sensitive, making AUC-PR more robust.
B. AUC-ROC assumes that the costs of false positives and false negatives are equal.
C. AUC-ROC includes True Negatives in its calculation (via the False Positive Rate), and in an imbalanced setting, a model can achieve a high score by simply correctly identifying the overwhelmingly large number of true negatives.
D. AUC-ROC is only applicable to linear models, while AUC-PR can be used for any classifier.

57 In a Bayesian network, if a directed path from node X to node Z exists, such as , what is the effect of conditioning on node Y?

Bayesian networks, and probabilistic reasoning Hard
A. It has no effect on the relationship between X and Z.
B. It makes X and Z conditionally independent.
C. It reverses the direction of influence from Z to X.
D. It makes X and Z conditionally dependent.

58 You are building a linear regression model and suspect multicollinearity. You observe that the Variance Inflation Factors (VIFs) for several predictors are very high (>10). How does Ridge Regression (L2 regularization) specifically address the mathematical instability caused by this issue?

Statistics Hard
A. By performing feature selection and setting the coefficients of correlated predictors to exactly zero.
B. By adding a positive value (X^T X$ matrix before inversion, making the matrix non-singular and stable even with correlated features.
C. By using a robust loss function that is less sensitive to the large coefficients that result from multicollinearity.
D. By transforming the correlated features into a new set of orthogonal features using PCA before fitting the model.

59 In the context of deep learning, what is the critical difference between semi-supervised learning and transfer learning?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Transfer learning is used for reinforcement learning problems, while semi-supervised learning is used for classification problems.
B. Semi-supervised learning requires at least two different models to be trained, whereas transfer learning only requires fine-tuning one model.
C. Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data from the same task to improve performance, while transfer learning uses knowledge (e.g., weights) from a different, pre-trained task to bootstrap learning on a new task.
D. Transfer learning is a form of unsupervised learning, while semi-supervised learning is a form of supervised learning.

60 You are building a time-series forecasting model to predict next month's sales based on the previous 12 months. You decide to use 5-fold cross-validation by randomly shuffling and splitting your 5 years of monthly data. Why is this a critically flawed evaluation strategy?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. It reduces the amount of training data available in each fold, leading to a high-bias model.
B. Random shuffling is computationally inefficient for time-series data compared to a simple chronological split.
C. It violates the i.i.d. (independent and identically distributed) assumption, which is a necessary condition for all machine learning models.
D. It causes severe data leakage, as the model will be trained on data from the future to make 'predictions' about the past, leading to an unrealistically optimistic performance estimate.