Unit 3 - Practice Quiz

INT428 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 Which type of machine learning uses labeled data, where each data point is tagged with a correct output, to train a model?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Reinforcement Learning
B. Unsupervised Learning
C. Semi-supervised Learning
D. Supervised Learning

2 What measure of central tendency represents the most frequently occurring value in a dataset?

Statistics Easy
A. Mode
B. Mean
C. Median
D. Range

3 If you roll a single fair six-sided die, what is the probability of rolling an even number (2, 4, or 6)?

Probability Easy
A. 2/3
B. 1/6
C. 1/3
D. 1/2

4 In machine learning, a collection of numbers representing a single data point (e.g., height, weight, and age of a person) is typically stored as a:

Linear algebra (applied focus) Easy
A. Tensor
B. Matrix
C. Vector
D. Scalar

5 What is the primary purpose of cross-validation in model evaluation?

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. To simplify the features of the data
B. To make the model train faster
C. To assess how a model will generalize to an independent, unseen dataset
D. To always increase the model's accuracy on the training data

6 Grouping customers into different segments based on their purchasing habits, without any predefined labels for the groups, is a classic example of:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Unsupervised Learning
B. Regression
C. Supervised Learning
D. Reinforcement Learning

7 Bayes' Theorem is used to update the probability of a hypothesis based on new evidence. What is the term for this updated probability?

Bayes theorem Easy
A. Prior probability
B. Marginal probability
C. Likelihood
D. Posterior probability

8 A 2-dimensional grid of numbers, often used to represent an entire dataset where rows are data points and columns are features, is called a:

Linear algebra (applied focus) Easy
A. Diagonal
B. Matrix
C. Scalar
D. Vector

9 In the context of a binary classification model, what does 'Precision' measure?

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. The model's ability to correctly identify negative instances
B. The proportion of positive predictions that were actually correct
C. The overall accuracy of the model across all classes
D. The proportion of actual positive instances that were correctly identified

10 A self-driving car AI learning to navigate by receiving a 'reward' for a correct action and a 'penalty' for a mistake is using which type of machine learning?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Unsupervised Learning
B. Supervised Learning
C. Dimensionality Reduction
D. Reinforcement Learning

11 What is the 'median' of the following set of numbers: [1, 7, 3, 9, 5]?

Statistics Easy
A. 5
B. 25
C. 3
D. 7

12 What does a node in a Bayesian Network typically represent?

Bayesian networks, and probabilistic reasoning Easy
A. A random variable
B. A deterministic value
C. An entire dataset
D. A machine learning algorithm

13 The probability of any event is always a number between:

Probability Easy
A. -1 and 1 (inclusive)
B. 0 and 1 (inclusive)
C. 0 and 100 (inclusive)
D. 1 and infinity

14 The process of selecting, transforming, or creating the most suitable input variables for a machine learning model is called:

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. Model evaluation
B. Feature engineering
C. Algorithm selection
D. Cross-validation

15 Predicting the exact price of a house based on its size, location, and number of bedrooms is an example of what kind of problem?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Association
B. Clustering
C. Classification
D. Regression

16 What is a scalar in the context of linear algebra?

Linear algebra (applied focus) Easy
A. An array of numbers
B. A single number
C. A type of model
D. A grid of numbers

17 Email spam detection, where an algorithm is trained on emails already labeled as 'spam' or 'not spam', is a task known as:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Easy
A. Classification
B. Regression
C. Clustering
D. Reinforcement Learning

18 In a dataset of student test scores, the difference between the highest and lowest score is known as the:

Statistics Easy
A. Mean
B. Variance
C. Standard Deviation
D. Range

19 In a Bayesian Network, what does a directed edge (an arrow) from Node A to Node B signify?

Bayesian networks, and probabilistic reasoning Easy
A. The state of Node A directly influences the probability of the state of Node B
B. Node A and Node B have the same probability distribution
C. The state of Node B directly influences the probability of the state of Node A
D. Node A and Node B are completely independent

20 A 'False Positive' in a medical test designed to detect a disease means:

Feature engineering and model evaluation (cross-validation, precision, recall) Easy
A. The test incorrectly indicates a sick person is healthy
B. The test correctly indicates a sick person has the disease
C. The test correctly indicates a healthy person is healthy
D. The test incorrectly indicates a healthy person has the disease

21 A model designed to detect a rare but critical disease has a high precision of 95% but a very low recall of 10%. What is the most accurate interpretation of this result?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. The model correctly identifies most of the patients who have the disease.
B. The model is highly reliable when it predicts a patient has the disease, but it misses most of the actual positive cases.
C. The model has a high overall accuracy and is performing well.
D. The model incorrectly flags many healthy patients as having the disease.

22 An e-commerce company wants to group its customers into distinct segments based on purchasing behavior (e.g., frequency, items bought, total spending) without any predefined labels for these segments. Which type of machine learning is most suitable for this task?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Unsupervised Learning
B. Supervised Learning
C. Reinforcement Learning
D. Semi-Supervised Learning

23 In Principal Component Analysis (PCA), the eigenvectors of the data's covariance matrix represent the...

Linear algebra (applied focus) Medium
A. variance of each principal component.
B. number of clusters in the data.
C. directions of maximum variance in the data.
D. average value of each feature.

24 A medical test for a disease has a 99% accuracy rate (it's correct 99% of the time). The disease has a prevalence of 1 in 10,000 people. If a randomly selected person tests positive, what can you conclude about the probability that they actually have the disease?

Bayes theorem, Bayesian networks, and probabilistic reasoning Medium
A. The probability is very high, but slightly less than 99%.
B. The probability is actually quite low (much less than 50%).
C. The probability is 99%.
D. It is impossible to determine without knowing the false negative rate.

25 A dataset of employee salaries at a tech company is heavily right-skewed due to a few extremely high executive salaries. Which measure of central tendency would provide the most realistic representation of a 'typical' employee's salary?

Statistics Medium
A. Mean
B. Median
C. Standard Deviation
D. Mode

26 A program learns to play chess by making moves and receiving a reward of +1 for a win, -1 for a loss, and 0 for a draw after each game. The program's goal is to maximize its cumulative reward over many games. This scenario is a prime example of:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Reinforcement Learning
B. Regression (Supervised Learning)
C. Clustering (Unsupervised Learning)
D. Classification (Supervised Learning)

27 To evaluate a model's performance and generalizability while tuning its hyperparameters, the standard practice is to split the data into three sets. What is the primary purpose of the 'validation' set?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. To select the best model hyperparameters without 'leaking' information from the test set.
B. To provide a final, unbiased evaluation of the model's performance on unseen data.
C. To increase the amount of data available for training.
D. To train the final model after hyperparameters have been chosen.

28 In natural language processing, words are often represented as high-dimensional vectors (word embeddings). The cosine similarity between two word vectors measures:

Linear algebra (applied focus) Medium
A. The difference in word frequency.
B. The number of characters the words have in common.
C. The semantic similarity or relatedness of the words.
D. The Euclidean distance between the words in the vector space.

29 In a classification problem, the output of a logistic regression model for a given input is 0.7. What does this value represent?

Probability Medium
A. The margin of separation from the decision boundary.
B. The accuracy of the model on this specific input.
C. The probability that the input belongs to the positive class.
D. The predicted class label is 0.7.

30 In a simple Bayesian Network representing a medical diagnosis, we have the structure: Disease -> Symptom. If we know a patient has the Disease, what does this tell us about the probability of them having the Symptom?

Bayes theorem, Bayesian networks, and probabilistic reasoning Medium
A. Knowing the patient has the Disease makes the Symptom certain to occur.
B. The probability of the Symptom becomes 0.
C. The probability of the Symptom is now conditioned on the presence of the Disease, and is given by .
D. Knowing the patient has the Disease does not change the probability of the Symptom.

31 What is the primary motivation for standardizing features (e.g., using Z-score normalization) before applying distance-based algorithms like K-Nearest Neighbors (KNN)?

Statistics Medium
A. To convert all features into a [0, 1] range.
B. To reduce the number of features in the dataset.
C. To prevent features with larger scales from dominating the distance calculations.
D. To make the data conform to a normal distribution.

32 If you multiply a 2D vector by the matrix , what geometric transformation is applied to the vector?

Linear algebra (applied focus) Medium
A. A 90-degree counter-clockwise rotation.
B. A projection onto the x-axis.
C. A reflection across the y-axis.
D. A scaling by a factor of 2.

33 You perform 5-fold cross-validation to evaluate your machine learning model. If your dataset has 1000 instances, how many instances are in the training set for each of the 5 iterations?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. 200
B. 500
C. 1000
D. 800

34 Which of the following problems is best framed as a regression task rather than a classification task?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Predicting the price of a house based on its features.
B. Determining if an email is spam or not spam.
C. Identifying the species of a flower from a photo.
D. Predicting whether a customer will churn (yes/no).

35 Two events A and B are mutually exclusive. If P(A) = 0.4 and P(B) = 0.3, what is the probability of A or B occurring, i.e., ?

Probability Medium
A. 0.1
B. 0.12
C. 1.0
D. 0.7

36 You are building a spam filter. Given that P(Spam) = 0.2, P(contains 'free' | Spam) = 0.8, and P(contains 'free' | Not Spam) = 0.05. Using Bayes' theorem, what are you trying to calculate?

Bayes theorem, Bayesian networks, and probabilistic reasoning Medium
A. P(Not Spam)
B. P(Spam | contains 'free')
C. P(contains 'free')
D. P(Spam, contains 'free')

37 What is the primary risk of performing feature selection based on the model's performance on the final test set?

Feature engineering and model evaluation (cross-validation, precision, recall) Medium
A. It can lead to a model that is too simple (underfitting).
B. It requires the data to be normally distributed.
C. It causes information from the test set to leak into the model selection process, leading to an over-optimistic performance estimate.
D. It is computationally too expensive.

38 What does it imply if the determinant of a transformation matrix used in a machine learning model is zero?

Linear algebra (applied focus) Medium
A. The transformation is a pure rotation.
B. The transformation collapses the data into a lower-dimensional space.
C. The transformation is an identity operation (no change).
D. The transformation scales the data uniformly.

39 In a linear regression analysis, you create a plot of residuals versus fitted values and observe a distinct funnel shape (heteroscedasticity). What key assumption of linear regression does this violate?

Statistics Medium
A. Constant variance of errors (homoscedasticity).
B. Normality of errors.
C. Linearity of the relationship.
D. Independence of errors.

40 A hospital has a large dataset of patient records, where a small fraction of the records are labeled with a correct diagnosis by expert doctors, but the vast majority are unlabeled. The goal is to build a diagnostic model using all the available data. This problem is an example of:

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Medium
A. Active Learning
B. Multi-task Learning
C. Transfer Learning
D. Semi-Supervised Learning

41 In the context of Principal Component Analysis (PCA), if the covariance matrix of your centered data is a non-identity diagonal matrix (i.e., diagonal entries are positive but not all equal to 1), what can you definitively conclude about the principal components?

Linear algebra (applied focus) Hard
A. The covariance matrix is singular, and PCA cannot be computed.
B. The principal components are aligned with the original feature axes, and the transformation is essentially a scaling, not a rotation.
C. The principal components will be at a 45-degree angle to the original axes, reflecting an average of the variances.
D. The data is perfectly correlated, and PCA will reduce its dimensionality to one.

42 You are developing a credit fraud detection model with a dataset where only 0.1% of transactions are fraudulent. After training, you achieve 99.9% accuracy. You then evaluate using the Area Under the Precision-Recall Curve (AUC-PR) and get a score of 0.2. What is the most accurate and nuanced interpretation of these results?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. The model is severely overfitting, as indicated by the large discrepancy between the accuracy score and the AUC-PR score.
B. The model is excellent because the accuracy is nearly perfect, and the low AUC-PR score must be an error in calculation or interpretation.
C. An AUC-PR of 0.2 is very poor for any dataset, indicating the model's predictions are no better than random guessing.
D. The high accuracy is a misleading metric due to extreme class imbalance, and the AUC-PR of 0.2, while appearing low, is significantly better than a random baseline and indicates the model has some, albeit imperfect, skill.

43 In the Bayesian network defined by the structure , which statement about the probabilistic relationship between nodes A and C is correct?

Bayesian networks, and probabilistic reasoning Hard
A. A and C are marginally dependent but become conditionally independent given B.
B. A and C are marginally independent but become conditionally dependent given B.
C. A and C are both marginally and conditionally independent of each other.
D. A and C are both marginally and conditionally dependent on each other.

44 A robotics company wants to train a bipedal robot to walk on varied and unseen terrain. The robot receives sensor data about its joint angles and orientation and a sparse positive reward only when it reaches a destination. It receives a large negative reward if it falls. Which specific class of algorithms within a broader ML paradigm is most appropriate for this task?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Model-based Reinforcement Learning to create a perfect simulation of the robot's dynamics.
B. Supervised Learning with regression to predict optimal joint angles.
C. Unsupervised Learning (e.g., autoencoders) to learn a low-dimensional representation of the terrain.
D. Model-free, policy-based Reinforcement Learning (e.g., PPO, A3C)

45 When solving a linear regression problem using the normal equation, , you find that the matrix is singular (non-invertible). What is the most likely data-related cause and the mathematical consequence?

Linear algebra (applied focus) Hard
A. Cause: The target variable y contains extreme outliers. Consequence: The matrix inverse cannot be computed.
B. Cause: The number of samples is far greater than the number of features. Consequence: The model will underfit the data.
C. Cause: Perfect multicollinearity among features. Consequence: The system has infinite solutions for the regression coefficients .
D. Cause: Features are not normalized. Consequence: The inverse operation is numerically unstable, but a unique solution technically exists.

46 In a K-fold cross-validation setup, what is the primary statistical trade-off when increasing the value of K from 5 to N (where N is the total number of samples), a technique also known as Leave-One-Out Cross-Validation (LOOCV)?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. The computational cost decreases, but the bias of the estimate increases.
B. Both the bias and variance of the performance estimate decrease, improving reliability.
C. The variance of the performance estimate decreases, but its bias increases.
D. The bias of the performance estimate decreases, but its variance increases significantly.

47 A rare disease affects 1 in 10,000 people. A test for the disease is developed with a 99% true positive rate and a 98% true negative rate. If a person tests positive, what is the probability they actually have the disease? The key challenge here is the combination of a rare event and an imperfect test.

Bayes theorem Hard
A. Approximately 0.49%
B. Approximately 2%
C. Approximately 50%
D. Approximately 99%

48 What is the primary motivation for using an off-policy reinforcement learning algorithm like Q-Learning over an on-policy one like SARSA?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. It avoids the need for a discount factor gamma, simplifying the Bellman equation.
B. It allows the agent to learn about the optimal policy while behaving according to an exploratory (sub-optimal) policy.
C. It is a model-based approach, which is more sample-efficient than the model-free on-policy methods.
D. It guarantees faster convergence by reducing the variance of the value function updates.

49 A dataset has two binary features, and , and a binary class . You observe that and . If you also know that and are conditionally independent given , can be lower than 0.6? Why or why not?

Probability Hard
A. Yes, but only if the features and are negatively correlated.
B. No, because the two features independently provide evidence for Y=1, their combined evidence must be stronger than either alone.
C. Yes, this can happen if and are both very high, causing the evidence for to accumulate faster.
D. No, the conditional independence assumption implies that , which would be .

50 You are comparing two complex models (e.g., a deep neural network and a gradient boosting machine) using 5x2 cross-validation and a paired t-test on the results to claim statistical significance. According to Dietterich (1998), why might this statistical test have a higher than desired Type I error rate in this specific machine learning context?

Statistics Hard
A. The 5x2 CV procedure systematically biases the performance estimate in favor of the more complex model.
B. A paired t-test cannot be used to compare two different algorithms; it can only compare one algorithm with different hyperparameters.
C. The training sets in cross-validation are highly overlapping, which violates the independence assumption of the t-test, leading to an underestimation of the true variance.
D. The underlying distribution of model accuracies is often not Gaussian, which is a core assumption of the t-test.

51 You apply both K-Means and a Gaussian Mixture Model (GMM) with 3 components to a dataset. You find that K-Means identifies three well-separated, spherical clusters, while the GMM identifies three overlapping, elliptical clusters. Which statement is the most valid conclusion?

Unsupervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Both algorithms have failed; K-Means is too simplistic and GMM is too complex, so a density-based algorithm like DBSCAN should be used.
B. The results are equivalent, as GMM is just a probabilistic version of K-Means and will always converge to a similar result.
C. GMM provides a more nuanced result by modeling cluster covariance and providing probabilistic assignments, which is likely superior if the true clusters are not perfectly spherical and separated.
D. K-Means is the correct choice because it produced well-separated clusters, indicating the GMM is overfitting the data by creating complex shapes.

52 You are building a model to predict house prices and have a 'zip_code' categorical feature with over 1000 unique values. Why is one-hot encoding this feature for a linear regression model often a poor choice, and what is a more effective (though potentially risky) alternative?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. Poor choice: It creates a very high-dimensional, sparse feature space (curse of dimensionality) which can hurt model performance and interpretability. Alternative: Target encoding, where each zip code is replaced by the average house price within that zip code.
B. Poor choice: It violates the independence assumption of linear regression. Alternative: Hashing the feature into a smaller number of dimensions.
C. Poor choice: It introduces perfect multicollinearity into the feature matrix. Alternative: Using ordinal encoding by sorting zip codes numerically.
D. Poor choice: It cannot be used in a linear model, only in tree-based models. Alternative: Deleting the feature from the dataset entirely.

53 You are modeling a system with a Naive Bayes classifier. The 'naive' assumption is that all features are conditionally independent given the class, i.e., . If two features, and , are actually perfectly correlated, how does this violation of the assumption affect the posterior probability calculation for class Y?

Bayesian networks, and probabilistic reasoning Hard
A. It will 'double-count' the evidence from the correlated features, leading to posterior probabilities that are unjustifiably extreme (pushed towards 0 or 1).
B. It will cause the model to ignore one of the features, as the information is redundant, leading to a loss of signal.
C. The model's posterior probability calculation will fail due to a division by zero error caused by the linear dependency.
D. The violation has no effect on the rank-ordering of the posterior probabilities, so the final classification decision remains optimal.

54 In a recommender system using truncated Singular Value Decomposition (SVD) on the user-item matrix , what is the geometric interpretation of predicting a user's rating for an unseen item?

Linear algebra (applied focus) Hard
A. It is calculated by projecting the user's vector onto the principal components of the item-item covariance matrix.
B. It is equivalent to taking the dot product of the user's vector in the k-dimensional latent space (a row in ) and the item's vector in the same space (a row in ).
C. It is the cosine similarity between the user's latent factor vector (from ) and the item's latent factor vector (from ).
D. It involves finding the Euclidean distance between the user's vector and the item's vector in the original high-dimensional space.

55 You have trained a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. You observe that the model has very high accuracy on the training set but poor accuracy on the test set. Which hyperparameter adjustments are most likely to mitigate this overfitting?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Decrease the regularization parameter C and decrease the kernel coefficient gamma.
B. Increase the regularization parameter C and decrease the kernel coefficient gamma.
C. Decrease the regularization parameter C and increase the kernel coefficient gamma.
D. Increase the regularization parameter C and increase the kernel coefficient gamma.

56 In a binary classification problem with severe class imbalance, where the positive class is rare but of high importance, why is the Area Under the ROC Curve (AUC-ROC) potentially a misleading metric compared to the Area Under the Precision-Recall Curve (AUC-PR)?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. AUC-ROC includes True Negatives in its calculation (via the False Positive Rate), and in an imbalanced setting, a model can achieve a high score by simply correctly identifying the overwhelmingly large number of true negatives.
B. AUC-ROC is only applicable to linear models, while AUC-PR can be used for any classifier.
C. AUC-PR is insensitive to the decision threshold, whereas AUC-ROC is highly sensitive, making AUC-PR more robust.
D. AUC-ROC assumes that the costs of false positives and false negatives are equal.

57 In a Bayesian network, if a directed path from node X to node Z exists, such as , what is the effect of conditioning on node Y?

Bayesian networks, and probabilistic reasoning Hard
A. It has no effect on the relationship between X and Z.
B. It reverses the direction of influence from Z to X.
C. It makes X and Z conditionally dependent.
D. It makes X and Z conditionally independent.

58 You are building a linear regression model and suspect multicollinearity. You observe that the Variance Inflation Factors (VIFs) for several predictors are very high (>10). How does Ridge Regression (L2 regularization) specifically address the mathematical instability caused by this issue?

Statistics Hard
A. By using a robust loss function that is less sensitive to the large coefficients that result from multicollinearity.
B. By performing feature selection and setting the coefficients of correlated predictors to exactly zero.
C. By transforming the correlated features into a new set of orthogonal features using PCA before fitting the model.
D. By adding a positive value (X^T X$ matrix before inversion, making the matrix non-singular and stable even with correlated features.

59 In the context of deep learning, what is the critical difference between semi-supervised learning and transfer learning?

Supervised, unsupervised, and reinforcement learning: concepts and real-world use Hard
A. Semi-supervised learning requires at least two different models to be trained, whereas transfer learning only requires fine-tuning one model.
B. Transfer learning is used for reinforcement learning problems, while semi-supervised learning is used for classification problems.
C. Transfer learning is a form of unsupervised learning, while semi-supervised learning is a form of supervised learning.
D. Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data from the same task to improve performance, while transfer learning uses knowledge (e.g., weights) from a different, pre-trained task to bootstrap learning on a new task.

60 You are building a time-series forecasting model to predict next month's sales based on the previous 12 months. You decide to use 5-fold cross-validation by randomly shuffling and splitting your 5 years of monthly data. Why is this a critically flawed evaluation strategy?

Feature engineering and model evaluation (cross-validation, precision, recall) Hard
A. It causes severe data leakage, as the model will be trained on data from the future to make 'predictions' about the past, leading to an unrealistically optimistic performance estimate.
B. It violates the i.i.d. (independent and identically distributed) assumption, which is a necessary condition for all machine learning models.
C. It reduces the amount of training data available in each fold, leading to a high-bias model.
D. Random shuffling is computationally inefficient for time-series data compared to a simple chronological split.