1What is the primary characteristic of the data used in unsupervised learning?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Easy
A.It requires an external reward signal to guide learning.
B.It consists of completely unlabeled data.
C.It contains only sequential data with timestamps.
D.It contains both features and explicitly labeled target variables.
Correct Answer: It consists of completely unlabeled data.
Explanation:
Unsupervised learning algorithms operate on datasets without predefined labels or target variables, aiming to find hidden structures from the input features alone.
Incorrect! Try again.
2Which learning paradigm uses a dataset containing a small amount of labeled data and a large amount of unlabeled data?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Easy
A.Semi-Supervised Learning
B.Unsupervised Learning
C.Supervised Learning
D.Reinforcement Learning
Correct Answer: Semi-Supervised Learning
Explanation:
Semi-supervised learning falls between supervised and unsupervised learning. It leverages a small portion of labeled data alongside a vast amount of unlabeled data to improve learning accuracy.
Incorrect! Try again.
3If a machine learning model is trained to predict house prices based on historical sales data where the final price is known, which type of learning is this?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Easy
A.Semi-Supervised Learning
B.Self-Supervised Learning
C.Supervised Learning
D.Unsupervised Learning
Correct Answer: Supervised Learning
Explanation:
Since the historical sales data includes the known final price (the label), predicting future prices based on this data is a classic supervised learning task.
Incorrect! Try again.
4In the mathematical formulation of an unsupervised learning problem, the dataset is typically represented as:
Problem formulation for unsupervised learning
Easy
A. where is a reward
B.
C.
D.
Correct Answer:
Explanation:
In unsupervised learning, there are no target labels . The dataset consists entirely of input features .
Incorrect! Try again.
5What is the primary objective when formulating an unsupervised learning problem?
Problem formulation for unsupervised learning
Easy
A.To manually assign categories to all incoming data streams.
B.To discover underlying patterns, groupings, or structures within the data.
C.To map input features to specific, predefined output labels accurately.
D.To maximize a numerical reward signal over time.
Correct Answer: To discover underlying patterns, groupings, or structures within the data.
Explanation:
Without labels to guide the algorithm, the main goal of unsupervised learning is to automatically find hidden patterns, such as clusters or lower-dimensional representations, in the data.
Incorrect! Try again.
6How is unsupervised learning applied in market segmentation?
Real-life use cases: Market segmentation
Easy
A.By mapping product images to specific text descriptions.
B.By predicting the exact dollar amount a customer will spend next month.
C.By automatically grouping customers with similar purchasing habits into distinct segments.
D.By classifying whether a transaction is approved or declined by a bank.
Correct Answer: By automatically grouping customers with similar purchasing habits into distinct segments.
Explanation:
Market segmentation uses clustering (an unsupervised technique) to find natural groupings of customers based on their behavior, without knowing the segments in advance.
Incorrect! Try again.
7Why is market segmentation considered an unsupervised learning problem?
Real-life use cases: Market segmentation
Easy
A.Because the company already knows exactly which segment each customer belongs to.
B.Because the algorithm discovers natural customer groupings without predefined category labels.
C.Because the data relies heavily on external rewards from ad clicks.
D.Because it requires a large team of humans to manually label the customer data.
Correct Answer: Because the algorithm discovers natural customer groupings without predefined category labels.
Explanation:
The true segments are unknown beforehand. The algorithm explores the data to find inherent groupings, making it an unsupervised task.
Incorrect! Try again.
8In customer behavior analysis, what does 'association rule learning' (an unsupervised method) typically discover?
Real-life use cases: Customer behavior analysis
Easy
A.The precise date a customer will cancel their subscription.
B.The exact age and gender of an anonymous website visitor.
C.Rules showing which products are frequently bought together (e.g., 'If bread, then butter').
D.The mathematical equation for the company's total annual revenue.
Correct Answer: Rules showing which products are frequently bought together (e.g., 'If bread, then butter').
Explanation:
Association rule learning finds interesting relations between variables in large databases, such as identifying items that frequently co-occur in shopping baskets.
Incorrect! Try again.
9When an e-commerce platform groups website visitors based on their browsing paths to better understand user journeys, this is an example of:
Real-life use cases: Customer behavior analysis
Easy
A.Reinforcement Learning
B.Customer behavior analysis using unsupervised learning
C.Supervised Image Classification
D.Predictive regression modeling
Correct Answer: Customer behavior analysis using unsupervised learning
Explanation:
Grouping browsing paths without predefined categories is a way to discover organic user behaviors, which relies on unsupervised learning.
Incorrect! Try again.
10What is the main goal of anomaly detection in data mining?
Real-life use cases: Anomaly & fraud detection
Easy
A.To find the most common and frequent items in a dataset.
B.To identify rare data points or events that deviate significantly from normal behavior.
C.To classify images into predefined categories.
D.To calculate the average value of all numerical features.
Correct Answer: To identify rare data points or events that deviate significantly from normal behavior.
Explanation:
Anomaly detection looks for the 'odd ones out'—data points that do not conform to the expected pattern or standard behavior.
Incorrect! Try again.
11Why is unsupervised learning highly suitable for detecting new, unseen types of credit card fraud?
Real-life use cases: Anomaly & fraud detection
Easy
A.Because it trains the fraudster on how to avoid detection.
B.Because it detects deviations from normal spending patterns without needing prior examples of the new fraud.
C.Because it relies on explicitly labeled examples of past fraud.
Correct Answer: Because it detects deviations from normal spending patterns without needing prior examples of the new fraud.
Explanation:
Unsupervised anomaly detection establishes what 'normal' behavior looks like and flags anything unusual, allowing it to catch entirely new fraud techniques that haven't been labeled yet.
Incorrect! Try again.
12In bioinformatics, clustering algorithms are often used to group genes. What is the algorithm attempting to discover?
Real-life use cases: Pattern discovery in biological and social data
Easy
A.The English translation of the genetic code.
B.Genes that have similar expression patterns under certain conditions.
C.The specific name of the disease caused by a single gene.
D.The exact age of the organism from which the gene was extracted.
Correct Answer: Genes that have similar expression patterns under certain conditions.
Explanation:
Gene clustering is an unsupervised technique used to group genes that behave similarly, which can imply they share related functions or regulatory mechanisms.
Incorrect! Try again.
13In a social network, discovering tightly knit groups of friends or users who interact frequently is known as:
Real-life use cases: Pattern discovery in biological and social data
Easy
A.Community detection
B.Regression analysis
C.Time-series forecasting
D.Supervised binary classification
Correct Answer: Community detection
Explanation:
Community detection is an unsupervised pattern discovery method used in graph/social data to find clusters of highly connected nodes (communities).
Incorrect! Try again.
14Which distance metric represents the shortest straight-line distance between two points in Euclidean space?
When two vectors point in the exact same direction, the angle between them is . Since , the cosine similarity is $1$.
Incorrect! Try again.
19Why is the choice of distance metric crucial in an unsupervised learning algorithm like K-Means?
when and why distance choice matters
Easy
A.Because it determines the learning rate of the neural network.
B.Because it defines the mathematical definition of 'similarity', which dictates how data points are grouped.
C.Because unsupervised algorithms cannot run unless distance is measured in miles.
D.Because it acts as the labeled target variable for the model.
Correct Answer: Because it defines the mathematical definition of 'similarity', which dictates how data points are grouped.
Explanation:
Clustering algorithms group data based on how 'close' or 'similar' points are. Different distance metrics (e.g., Euclidean vs. Manhattan) define 'closeness' differently, leading to completely different cluster shapes.
Incorrect! Try again.
20If you are clustering text documents based on word frequencies, why is Cosine similarity usually preferred over Euclidean distance?
when and why distance choice matters
Easy
A.Because text documents are always measured in a 2D space.
B.Because Euclidean distance cannot be computed on numerical data.
D.Because Cosine similarity focuses on the orientation (content similarity) rather than the magnitude (document length).
Correct Answer: Because Cosine similarity focuses on the orientation (content similarity) rather than the magnitude (document length).
Explanation:
In text analysis, a long document and a short document on the same topic will have vastly different vector magnitudes but point in the same direction. Cosine similarity correctly identifies them as similar by ignoring length.
Incorrect! Try again.
21A medical research facility has a dataset of 50,000 X-ray images, but only 1,000 of them have been diagnosed and tagged by expert radiologists. The facility wants to build a model to categorize the remaining images. Which learning paradigm is most appropriate for this scenario?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Medium
A.Reinforcement Learning, by rewarding the model for correctly guessing the labels of the 49,000 images.
B.Supervised Learning, by training only on the 1,000 labeled images and ignoring the rest.
C.Semi-Supervised Learning, by using the 1,000 labeled images to guide the learning process over the entire dataset.
D.Unsupervised Learning, by clustering all 50,000 images and ignoring the labels entirely.
Correct Answer: Semi-Supervised Learning, by using the 1,000 labeled images to guide the learning process over the entire dataset.
Explanation:
Semi-supervised learning is ideal when you have a small amount of labeled data and a large amount of unlabeled data. It leverages the structural information of the unlabeled data alongside the explicit targets of the labeled data to improve model accuracy.
Incorrect! Try again.
22Which of the following scenarios describes a transition from a supervised learning problem to an unsupervised learning problem?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Medium
A.Switching from using a small set of labeled data and a large set of unlabeled data to using exclusively labeled data.
B.Switching from clustering news articles by similarity to using a pre-trained model to classify them into 'Sports', 'Politics', and 'Tech'.
C.Switching from predicting customer churn based on past data to predicting future sales revenue.
D.Switching from classifying emails as spam or not spam to identifying natural topical groupings in a large corpus of text without using predefined categories.
Correct Answer: Switching from classifying emails as spam or not spam to identifying natural topical groupings in a large corpus of text without using predefined categories.
Explanation:
Supervised learning relies on labeled data (like spam/not spam). Removing the predefined labels and simply looking for inherent structures or groupings (topics) represents an unsupervised approach.
Incorrect! Try again.
23In contrasting learning paradigms, which fundamental characteristic distinguishes how a model's performance is typically evaluated in supervised versus unsupervised learning?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Medium
A.Supervised learning evaluation relies on comparing predictions to known ground-truth labels, whereas unsupervised learning often relies on internal metrics like intra-cluster variance.
B.Supervised learning uses distance metrics like Euclidean distance, whereas unsupervised learning strictly uses loss functions like Cross-Entropy.
C.There is no difference; both paradigms require a hold-out test set with ground-truth labels to evaluate generalization.
D.Supervised learning evaluates the speed of convergence, whereas unsupervised learning evaluates the size of the dataset processed.
Correct Answer: Supervised learning evaluation relies on comparing predictions to known ground-truth labels, whereas unsupervised learning often relies on internal metrics like intra-cluster variance.
Explanation:
Because unsupervised learning lacks ground-truth labels, its performance cannot be evaluated using metrics like accuracy or error rate. Instead, it relies on heuristic internal metrics such as cluster cohesion or separation (e.g., silhouette score).
Incorrect! Try again.
24Let represent a dataset. Which of the following best represents a common mathematical objective in an unsupervised dimensionality reduction task?
Problem formulation for unsupervised learning
Medium
A.Find a transformation to project into (where ) such that the variance of the projected data is maximized or reconstruction error is minimized.
B.Maximize the reward function over a sequence of actions taken within the data space.
C.Assign each data point to a class by maximizing the margin between the classes.
D.Find a mapping function , where minimizes the Mean Squared Error against a target variable.
Correct Answer: Find a transformation to project into (where ) such that the variance of the projected data is maximized or reconstruction error is minimized.
Explanation:
Dimensionality reduction (e.g., PCA) is an unsupervised task formulated to reduce the number of features () while retaining as much information (variance) as possible, without reference to any target labels.
Incorrect! Try again.
25In the formulation of a clustering problem, the objective is often to partition a set of observations into sets . What is the typical goal of the objective function in this context?
Problem formulation for unsupervised learning
Medium
A.To assign equal numbers of observations to each set regardless of the distance between points.
B.To maximize the distance between data points within the same set .
C.To minimize the distance between the centroids of different sets and .
D.To minimize the within-cluster sum of squares (intra-cluster variance) and maximize the inter-cluster variance.
Correct Answer: To minimize the within-cluster sum of squares (intra-cluster variance) and maximize the inter-cluster variance.
Explanation:
A fundamental objective of clustering is to ensure that points within the same cluster are as similar as possible (minimizing intra-cluster variance) and that different clusters are as distinct as possible (maximizing inter-cluster variance).
Incorrect! Try again.
26A retail brand applies a clustering algorithm to demographic and purchasing data to achieve market segmentation. What is the primary operational benefit of the output generated by this unsupervised learning task?
Real-life use cases: Market segmentation
Medium
A.It perfectly predicts the exact monetary value of the next purchase for every individual customer.
B.It automatically labels which customers will churn in the next 30 days based on historical ground truth.
C.It eliminates the need for capturing future customer demographic data.
D.It identifies distinct customer profiles based on shared behaviors, allowing for highly targeted and personalized marketing campaigns.
Correct Answer: It identifies distinct customer profiles based on shared behaviors, allowing for highly targeted and personalized marketing campaigns.
Explanation:
Market segmentation groups customers with similar traits. While it doesn't predict exact future values or churn (which are supervised tasks), it allows marketers to tailor campaigns to specific behavioral profiles discovered by the clustering algorithm.
Incorrect! Try again.
27An e-commerce platform uses association rule mining, an unsupervised learning technique, to analyze customer transaction logs. Which of the following insights is most likely derived from this analysis?
Customer behavior analysis
Medium
A.The identification of fraudulent credit card transactions during checkout.
B.The discovery that customers who purchase laptops are 70% more likely to also purchase a wireless mouse in the same transaction.
C.The exact probability that a specific customer will return an item next month.
D.The classification of user reviews into positive, neutral, or negative sentiments.
Correct Answer: The discovery that customers who purchase laptops are 70% more likely to also purchase a wireless mouse in the same transaction.
Explanation:
Association rule mining (like the Apriori algorithm) discovers co-occurrence patterns in transactional data, often referred to as 'Market Basket Analysis'. Identifying frequently bought together items is a classic example of this.
Incorrect! Try again.
28When building an unsupervised anomaly detection system for credit card fraud, the model learns the normal distribution of transactions. How does it identify a potentially fraudulent transaction?
Anomaly & fraud detection
Medium
A.By flagging transactions that fall into low-density regions of the learned probability distribution.
B.By classifying the transaction using a decision tree trained on labeled examples of past fraud.
C.By matching the transaction against a hard-coded database of known fraudulent IP addresses.
D.By looking up the user's past history of reported frauds.
Correct Answer: By flagging transactions that fall into low-density regions of the learned probability distribution.
Explanation:
Unsupervised anomaly detection models establish what 'normal' data looks like by estimating its density. Data points (transactions) that occur in very low-density regions deviate significantly from the norm and are flagged as anomalies.
Incorrect! Try again.
29What is a significant limitation of using strictly unsupervised learning for anomaly detection in a network intrusion system?
Anomaly & fraud detection
Medium
A.It often suffers from high false positive rates because novel but benign network behaviors may be flagged as anomalies.
B.It is unable to process high-dimensional network data.
C.It requires a massive, perfectly balanced dataset of both normal and attack traffic.
Correct Answer: It often suffers from high false positive rates because novel but benign network behaviors may be flagged as anomalies.
Explanation:
Unsupervised anomaly detection flags anything unusual as an anomaly. Consequently, perfectly safe but rare or new activities (like a sudden spike in legitimate traffic due to a viral event) are often incorrectly flagged, causing false positives.
Incorrect! Try again.
30Biologists are analyzing gene expression data (RNA-Seq) across different tissue samples. How can unsupervised learning help them discover new biological patterns?
Pattern discovery in biological and social data
Medium
A.By classifying the tissue samples as 'diseased' or 'healthy' using prior clinical labels.
B.By predicting the exact protein structure of a gene based on known homologous structures.
C.By grouping genes that exhibit similar expression profiles across samples, potentially revealing co-regulated gene networks.
D.By determining the precise mutation rate of the DNA sequence over time.
Correct Answer: By grouping genes that exhibit similar expression profiles across samples, potentially revealing co-regulated gene networks.
Explanation:
Clustering gene expression data is a classic unsupervised pattern discovery technique. Genes that behave similarly across conditions are often involved in the same biological pathways or are co-regulated.
Incorrect! Try again.
31In social network analysis, researchers apply a community detection algorithm to a graph of user interactions. What is the fundamental assumption underlying this unsupervised approach?
Pattern discovery in biological and social data
Medium
A.Every user must belong to exactly one perfectly sized community.
B.The distance between users is strictly proportional to their geographical distance.
C.Users within a community will have more connections to each other than they do to users outside the community.
D.The algorithm requires labeled examples of 'influencer' users to seed the communities.
Correct Answer: Users within a community will have more connections to each other than they do to users outside the community.
Explanation:
Community detection in graphs (a form of clustering) assumes that networks naturally compartmentalize into densely connected subgroups (communities) that are sparsely connected to other subgroups.
Incorrect! Try again.
32Consider two data points in a 2D space: and . What are the Euclidean distance and the Manhattan distance between point and point , respectively?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity
Medium
A.Euclidean: 5, Manhattan: 5
B.Euclidean: 25, Manhattan: 7
C.Euclidean: 5, Manhattan: 7
D.Euclidean: 7, Manhattan: 5
Correct Answer: Euclidean: 5, Manhattan: 7
Explanation:
Euclidean distance is . Manhattan distance is .
Incorrect! Try again.
33Which of the following scenarios best justifies the use of Cosine similarity over Euclidean distance?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity
Medium
A.When the dataset consists of categorical variables that have been one-hot encoded, and exact matches matter most.
B.When comparing documents represented by TF-IDF vectors, where the length of the document should not heavily influence the similarity.
C.When trying to find the shortest driving path in a city grid.
D.When clustering locations based on GPS coordinates where absolute physical distance is important.
Correct Answer: When comparing documents represented by TF-IDF vectors, where the length of the document should not heavily influence the similarity.
Explanation:
Cosine similarity measures the angle between two vectors, making it invariant to magnitude. This is ideal for text data, where a long document and a short document may discuss the exact same topic but have very different Euclidean distances due to their lengths.
Incorrect! Try again.
34Given two non-zero vectors and in , under what condition will the Cosine similarity between them be equal to $0$?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity
Medium
A.When and point in exactly opposite directions.
B.When and are identical.
C.When and are orthogonal (perpendicular) to each other.
D.When the Euclidean distance between and is exactly $1$.
Correct Answer: When and are orthogonal (perpendicular) to each other.
Explanation:
Cosine similarity is defined as . If the vectors are orthogonal, their dot product is $0$, resulting in a Cosine similarity of $0$. This corresponds to an angle of 90 degrees.
Incorrect! Try again.
35A dataset has two features: Age (ranging from 18 to 80) and Annual Income (ranging from $20,000 to $150,000). If a clustering algorithm uses Euclidean distance on the raw, unscaled data, what is the most likely outcome?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity
Medium
A.The Annual Income feature will dominate the distance calculations, making Age almost irrelevant to cluster formation.
B.The algorithm will fail to compute the distance because the units (years vs. dollars) are different.
C.The Age feature will dominate the distance calculations because smaller numbers are squared.
D.The algorithm will naturally balance both features because Euclidean distance is scale-invariant.
Correct Answer: The Annual Income feature will dominate the distance calculations, making Age almost irrelevant to cluster formation.
Explanation:
Euclidean distance is highly sensitive to the scale of the features. Because the differences in Income will be in the tens of thousands while Age differences are at most 62, the squared differences in Income will completely overwhelm the Age differences unless the data is normalized.
Incorrect! Try again.
36Why might a data scientist choose Manhattan distance ( norm) over Euclidean distance ( norm) when dealing with a high-dimensional dataset that contains several extreme outliers?
when and why distance choice matters
Medium
A.Because Manhattan distance uses absolute differences, making it less sensitive to the large deviations caused by extreme outliers compared to the squared differences of Euclidean distance.
B.Because Manhattan distance inherently reduces the dimensionality of the dataset by ignoring zero-valued features.
C.Because Manhattan distance squares the differences, penalizing outliers more heavily and separating them into their own clusters.
D.Because Manhattan distance always guarantees that the clustering algorithm will converge to a global optimum.
Correct Answer: Because Manhattan distance uses absolute differences, making it less sensitive to the large deviations caused by extreme outliers compared to the squared differences of Euclidean distance.
Explanation:
Euclidean distance squares the difference between coordinates. If there is an outlier, the squared difference becomes extremely large, heavily skewing the results. Manhattan distance only takes the absolute difference, providing more robust behavior in the presence of outliers.
Incorrect! Try again.
37In a recommendation system, you want to cluster users based on their ratings of movies. The data is highly sparse because most users have only rated a few movies out of thousands. Which distance/similarity measure is generally most effective here, and why?
when and why distance choice matters
Medium
A.Minkowski distance with , because it isolates the single largest difference in movie ratings between two users.
B.Manhattan distance, because it evaluates the exact number of rating differences step-by-step.
C.Cosine similarity, because it focuses on the overlapping non-zero ratings and ignores the massive number of mutual zeros.
D.Euclidean distance, because it treats all unrated movies as zeros and directly computes the spatial distance.
Correct Answer: Cosine similarity, because it focuses on the overlapping non-zero ratings and ignores the massive number of mutual zeros.
Explanation:
In high-dimensional sparse data, two vectors might have thousands of matching zeros (unrated movies). Euclidean and Manhattan distances are severely skewed by the sparsity. Cosine similarity only considers the dimensions where at least one vector has a non-zero value (the dot product), making it highly effective for sparse data.
Incorrect! Try again.
38A routing algorithm for automated delivery drones must navigate a warehouse composed of strictly arranged parallel and perpendicular aisles. Which distance metric mathematically represents the true navigation path of the drone?
when and why distance choice matters
Medium
A.Euclidean distance
B.Cosine similarity
C.Manhattan distance
D.Mahalanobis distance
Correct Answer: Manhattan distance
Explanation:
Manhattan distance (or 'taxicab geometry') measures distance by summing the absolute differences of their coordinates. This perfectly models navigation restricted to parallel and perpendicular grid axes, like aisles in a warehouse.
Incorrect! Try again.
39Assume you are clustering data using K-Means, which traditionally minimizes the within-cluster sum of squares (Euclidean distance). If you change the underlying objective to minimize the sum of absolute differences (Manhattan distance), forming the K-Medians algorithm, how does the geometric shape of the theoretical cluster boundaries shift?
when and why distance choice matters
Medium
A.The boundaries shift from spherical/circular curves to piece-wise linear, axis-aligned (diamond-like) shapes.
B.The boundaries shift from axis-aligned squares to perfectly spherical curves.
C.The boundaries remain completely unchanged; only the centroid locations differ.
D.The boundaries disappear completely, as Manhattan distance cannot form enclosed regions.
Correct Answer: The boundaries shift from spherical/circular curves to piece-wise linear, axis-aligned (diamond-like) shapes.
Explanation:
Isolines (lines of equal distance from a center) for Euclidean distance are circles/spheres. For Manhattan distance, the isolines form diamonds (squares rotated by 45 degrees). This changes the geometric shape of the boundaries separating the clusters.
Incorrect! Try again.
40Suppose you are comparing two sets of vectors. Set A has dense, normally distributed continuous features. Set B consists of binary feature vectors where a 1 indicates the presence of a trait. Why does the choice of distance matter fundamentally between Set A and Set B?
when and why distance choice matters
Medium
A.Set B requires a distance metric that evaluates magnitudes (like Euclidean), whereas Set A requires metrics focusing on logical overlaps (like Jaccard).
B.Set B's binary nature means metrics like Jaccard or Hamming distance provide meaningful interpretations of trait overlap, whereas Euclidean is better suited for Set A's continuous magnitudes.
C.Set A represents categorical data better than Set B, requiring Cosine similarity.
D.The choice does not matter; Euclidean distance is universally optimal regardless of data distribution or type.
Correct Answer: Set B's binary nature means metrics like Jaccard or Hamming distance provide meaningful interpretations of trait overlap, whereas Euclidean is better suited for Set A's continuous magnitudes.
Explanation:
The nature of the data dictates the metric. For binary presence/absence data (Set B), measuring overlaps or mismatches (Jaccard, Hamming) is intuitive. For continuous, normally distributed variables (Set A), geometric distances like Euclidean are mathematically appropriate.
Incorrect! Try again.
41In semi-supervised learning, algorithms often rely on specific structural assumptions about the underlying data distribution to leverage unlabeled data effectively. Which of the following describes a scenario where applying semi-supervised learning is highly likely to degrade model performance compared to purely supervised learning?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Hard
A.The data strictly follows the manifold assumption, where high-dimensional data lies on a lower-dimensional structure.
B.The marginal distribution is uniformly distributed and provides no information about the conditional distribution .
C.The unlabeled data contains a high degree of missing values that are Missing Completely At Random (MCAR).
D.The cluster assumption holds, meaning points in the same dense region share the same class label.
Correct Answer: The marginal distribution is uniformly distributed and provides no information about the conditional distribution .
Explanation:
Semi-supervised learning works on the premise that carries information about (e.g., through cluster or manifold assumptions). If and are entirely independent or provides misleading structural clues, incorporating unlabeled data can mislead the model, degrading performance.
Incorrect! Try again.
42Consider a Positive-Unlabeled (PU) learning scenario formulated to find anomalies. Under what condition can PU learning be theoretically reduced to a standard semi-supervised learning problem?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Hard
A.When the negative class heavily outnumbers the positive class in the unlabeled dataset.
B.When the unlabeled set is drawn from the exact same marginal distribution as the test data.
C.PU learning cannot be reduced to semi-supervised learning because the absence of labeled negatives strictly defines it as unsupervised density estimation.
D.When the Selected Completely At Random (SCAR) assumption holds, meaning labeled positives are a uniform random sample of all true positives.
Correct Answer: When the Selected Completely At Random (SCAR) assumption holds, meaning labeled positives are a uniform random sample of all true positives.
Explanation:
Under the SCAR assumption, the probability of a positive example being labeled is constant. This allows the formulation of risk estimators that transform the PU learning problem into a standard binary classification (and by extension, semi-supervised) framework by estimating this constant class prior.
Incorrect! Try again.
43Transductive learning is often contrasted with inductive semi-supervised learning. Which of the following is a strict mathematical limitation of transductive Support Vector Machines (TSVMs) applied to a clustering-like formulation?
Supervised vs. Unsupervised vs. Semi-Supervised Learning
Hard
A.They can only operate using linear kernels because the manifold assumption is violated in infinite-dimensional spaces.
B.They require the unlabeled data to be mapped to a strictly orthogonal feature space relative to the labeled data.
C.They optimize the margin over both labeled and unlabeled data but fail to produce a global generalization function for unseen data out-of-sample.
D.They inherently assume that the unlabeled data follows a Gaussian distribution, limiting their use in non-parametric setups.
Correct Answer: They optimize the margin over both labeled and unlabeled data but fail to produce a global generalization function for unseen data out-of-sample.
Explanation:
Transductive learning algorithms, unlike inductive ones, are designed to make predictions only on the specific, provided unlabeled instances (the working set). They do not infer a general hypothesis or mapping function that can be applied to new, unseen data.
Incorrect! Try again.
44In latent variable models for unsupervised learning, the goal is often to maximize the log-likelihood of the observed data . Because the true posterior of the latent variables is often intractable, variational inference is used. Which inequality forms the fundamental basis for this problem formulation?
Problem formulation for unsupervised learning
Hard
A.Markov's Inequality
B.Cauchy-Schwarz Inequality
C.Chebyshev's Inequality
D.Jensen's Inequality
Correct Answer: Jensen's Inequality
Explanation:
Variational inference optimizes the Evidence Lower Bound (ELBO). The derivation of the ELBO from the log marginal likelihood relies fundamentally on Jensen's Inequality, utilizing the concavity of the logarithm function to establish a lower bound.
Incorrect! Try again.
45When formulating a centroid-based clustering algorithm (like K-Means) as an optimization problem, it is mathematically posed as minimizing the within-cluster sum of squares (WCSS). What makes finding the exact global minimum of this formulation computationally prohibitive?
Problem formulation for unsupervised learning
Hard
A.The problem requires computing the pseudo-inverse of a singular covariance matrix at each iteration.
B.The objective function is strictly convex but contains non-differentiable points at the cluster boundaries.
C.The optimization is fundamentally ill-posed because distance metrics fail to satisfy the triangle inequality in Euclidean space.
D.The search space involves combinatorial assignments of points to clusters, making the problem NP-hard for and .
Correct Answer: The search space involves combinatorial assignments of points to clusters, making the problem NP-hard for and .
Explanation:
K-means clustering is a mixed-integer optimization problem. While the centroid updates are continuous, the point assignments are discrete and combinatorial. It has been proven to be NP-hard in general Euclidean space for and dimensions , hence the use of heuristic algorithms like Lloyd's.
Incorrect! Try again.
46An unsupervised anomaly detection system minimizes a reconstruction error function . If the model is an undercomplete linear autoencoder (equivalent to PCA), which of the following best describes the subspace on which projects the data to formulate the 'normal' profile?
Problem formulation for unsupervised learning
Hard
A.The subspace spanned by the eigenvectors corresponding to the largest eigenvalues of the data's covariance matrix.
B.The subspace spanned by the eigenvectors corresponding to the smallest eigenvalues of the data's covariance matrix.
C.A non-linear manifold mapping defined by the kernel trick applied to the covariance matrix.
D.The subspace orthogonal to the principal components, capturing maximum data variance.
Correct Answer: The subspace spanned by the eigenvectors corresponding to the largest eigenvalues of the data's covariance matrix.
Explanation:
An undercomplete linear autoencoder learns to project data onto the principal subspace, which is spanned by the principal components (eigenvectors with the largest eigenvalues). Anomalies are detected by their high reconstruction error when projected back from this subspace.
Incorrect! Try again.
47In anomaly detection, particularly in fraud detection systems using distance-based unsupervised methods, the phenomenon of 'swamping' occurs. How is 'swamping' mathematically defined or observed in this context?
Real-life use cases: Market segmentation, Customer behavior analysis, Anomaly & fraud detection, Pattern discovery in biological and social data
Hard
A.When distance metrics collapse in high dimensions, causing all pairwise distances between normal and anomalous points to become uniform.
B.When the presence of massive amounts of normal data masks the outliers, pushing the outlier score below the detection threshold.
C.When normal instances are incorrectly classified as anomalies because they are drawn into the sparse feature space heavily influenced by true outliers.
D.When anomalous points are clustered so tightly together that they artificially inflate the local density, appearing as a normal cluster.
Correct Answer: When normal instances are incorrectly classified as anomalies because they are drawn into the sparse feature space heavily influenced by true outliers.
Explanation:
Swamping occurs when normal observations are classified as outliers. This usually happens when a group of outliers skews the statistical boundaries of the 'normal' model, pulling normal points into the outlier region. Masking is the opposite, where outliers are missed.
Incorrect! Try again.
48When applying unsupervised learning for pattern discovery in biological data (e.g., gene expression microarrays), 'biclustering' is often preferred over standard clustering. What specific problem formulation makes biclustering uniquely suited for this use case?
Real-life use cases: Market segmentation, Customer behavior analysis, Anomaly & fraud detection, Pattern discovery in biological and social data
Hard
A.It utilizes non-Euclidean distance metrics exclusively to account for non-linear gene interactions.
B.It forces the clusters into a hierarchical binary tree structure, aligning with phylogenetic evolutionary mapping.
C.It simultaneously clusters rows (genes) and columns (conditions), discovering genes that exhibit similar behavior only under a specific subset of conditions.
D.It automatically projects the data into a 2-dimensional latent space to handle the high sparsity of biological matrices.
Correct Answer: It simultaneously clusters rows (genes) and columns (conditions), discovering genes that exhibit similar behavior only under a specific subset of conditions.
Explanation:
Standard clustering groups genes based on their behavior across all conditions. Biclustering (or co-clustering) finds local patterns, identifying a subset of genes that are co-expressed only under a specific subset of conditions, which is crucial in biology as genes are often only active in specific environments.
Incorrect! Try again.
49A bank uses K-means for market segmentation based on continuous customer behavioral features. To evaluate the quality of the unsupervised segmentation, the data science team uses the Silhouette Coefficient. In which of the following edge cases will the Silhouette Coefficient misleadingly report a near-zero or negative score despite the clusters being perfectly separated?
Real-life use cases: Market segmentation, Customer behavior analysis, Anomaly & fraud detection, Pattern discovery in biological and social data
Hard
A.When the clusters are densely packed spherical distributions with identical variances.
B.When the clusters form concentric rings (e.g., non-convex geometries) and rely on density-based separation.
C.When the number of features (dimensions) far exceeds the number of observations ().
D.When all features have been rigorously standardized to have a mean of 0 and a variance of 1.
Correct Answer: When the clusters form concentric rings (e.g., non-convex geometries) and rely on density-based separation.
Explanation:
The Silhouette Coefficient is based on pairwise Euclidean distances. It implicitly assumes clusters are convex and isotropic (globular). If clusters are non-convex (like concentric rings), inter-cluster distances can be smaller than intra-cluster distances, leading to poor silhouette scores even if the clusters are distinct and properly separated by a density-based algorithm.
Incorrect! Try again.
50In customer behavior analysis, sequential pattern mining (e.g., PrefixSpan) is used to find frequent subsequences of purchases. If a transaction dataset has a very high diversity of items but short transaction lengths, why might an unsupervised frequent itemset mining algorithm like Apriori fail computationally while a sequential pattern approach is required?
Real-life use cases: Market segmentation, Customer behavior analysis, Anomaly & fraud detection, Pattern discovery in biological and social data
Hard
A.Short transaction lengths inherently violate the anti-monotonicity property of the support measure.
B.Sequential pattern mining models strictly assume a Gaussian distribution of item frequencies.
C.Apriori cannot handle categorical variables without one-hot encoding, leading to memory overflow.
D.Apriori generates a massive number of unpruned candidate itemsets at lower levels before finding support, causing a combinatorial explosion.
Correct Answer: Apriori generates a massive number of unpruned candidate itemsets at lower levels before finding support, causing a combinatorial explosion.
Explanation:
Apriori uses a breadth-first generation of candidates. In highly diverse datasets, the number of frequent 1-itemsets and 2-itemsets is massive, causing a combinatorial explosion during candidate generation, even if transactions are short. Methods like FP-Growth or pattern-growth (PrefixSpan for sequences) avoid this by not generating explicit candidates.
Incorrect! Try again.
51When building an autoencoder for fraud detection on credit card transactions, the training dataset consists exclusively of 'normal' transactions. If the autoencoder is designed with a massive latent dimension (overcomplete) without proper regularization, what will be the expected outcome during inference on real-world data containing fraud?
Real-life use cases: Market segmentation, Customer behavior analysis, Anomaly & fraud detection, Pattern discovery in biological and social data
Hard
A.The model will naturally approximate a linear PCA projection, maintaining standard anomaly detection capabilities.
B.The model will enforce extreme sparsity, causing normal transactions to have higher reconstruction errors than fraudulent ones.
C.The model will achieve a near-zero reconstruction error for both normal and fraudulent transactions, failing to detect fraud.
D.The model will easily detect fraud because the reconstruction error for all data points will universally increase.
Correct Answer: The model will achieve a near-zero reconstruction error for both normal and fraudulent transactions, failing to detect fraud.
Explanation:
An overcomplete autoencoder without regularization (like sparsity constraints or denoising objectives) will simply learn the identity function. It will perfectly memorize and reconstruct any input passed to it, including anomalies (fraud). Consequently, the reconstruction error will be low for everything, destroying its utility as an anomaly detector.
Incorrect! Try again.
52Let and be two real-valued feature vectors representing text documents. If and are both strictly -normalized such that and , what is the exact mathematical relationship between their squared Euclidean distance and their Cosine similarity ?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The squared Euclidean distance is . Since both vectors are -normalized, and . The dot product for unit vectors is exactly the Cosine similarity. Thus, .
Incorrect! Try again.
53According to the phenomenon associated with the 'curse of dimensionality' in distance metric spaces, as the dimensionality , what happens to the ratio for a given query point (where and are the maximum and minimum distances to other points)?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.It oscillates unpredictably, which is why Cosine similarity is exclusively used in high dimensions.
B.It approaches $0$, meaning the distance to the nearest neighbor and the farthest neighbor become virtually indistinguishable.
C.It approaches , meaning nearest neighbors become exponentially identifiable.
D.It converges to a non-zero constant dependent solely on the chosen norm.
Correct Answer: It approaches $0$, meaning the distance to the nearest neighbor and the farthest neighbor become virtually indistinguishable.
Explanation:
This is a statement of Beyer's theorem regarding distance concentration. As dimensions increase, under certain broad conditions, the relative difference between the farthest and nearest points vanishes (approaches 0). This causes distance-based unsupervised algorithms (like K-Means or KNN) to lose their discriminatory power.
Incorrect! Try again.
54A data scientist is designing a custom clustering algorithm using 'Cosine Distance', defined as . They intend to use a metric tree (e.g., Ball Tree) to speed up neighbor searches. Why will this approach fundamentally fail or yield incorrect optimizations?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
B.Cosine Distance is mathematically equivalent to the norm, rendering the spherical bounds of a Ball Tree inefficient.
C.Cosine Distance forces the metric tree to map all data into an infinite-dimensional Hilbert space.
D.Cosine Distance does not satisfy the triangle inequality, which is a strict requirement for metric trees.
Correct Answer: Cosine Distance does not satisfy the triangle inequality, which is a strict requirement for metric trees.
Explanation:
A true distance metric must satisfy non-negativity, identity of indiscernibles, symmetry, and the triangle inequality. Cosine Distance () does not satisfy the triangle inequality. Metric tree data structures rely entirely on the triangle inequality to prune search spaces; violating it leads to incorrect nearest-neighbor retrievals.
Incorrect! Try again.
55When performing clustering on high-dimensional data, researchers sometimes prefer fractional distance metrics ( norm where ) over standard Euclidean () or Manhattan () metrics. What is the primary theoretical justification for this choice?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.Fractional norms are less sensitive to the 'distance concentration' effect, providing better relative contrast between near and far points in high dimensions.
B.Fractional norms are computationally cheaper to compute because they bypass the need for floating-point exponentiation.
C.Fractional norms implicitly perform feature scaling, removing the need for standardization prior to clustering.
D.Fractional norms guarantee the convexity of the cluster boundaries, ensuring global convergence of K-Means.
Correct Answer: Fractional norms are less sensitive to the 'distance concentration' effect, providing better relative contrast between near and far points in high dimensions.
Explanation:
Research (e.g., by Aggarwal et al.) shows that as dimensionality increases, the contrast between maximum and minimum distances degrades. Using fractional norms () mitigates this effect better than or , providing more meaningful distance measurements in highly dimensional, sparse spaces.
Incorrect! Try again.
56In a dataset with highly correlated features with differing variances, standard Euclidean distance often creates skewed, elongated clusters. A common mitigation is to use the Mahalanobis distance. Which of the following data preprocessing steps followed by standard Euclidean distance is mathematically equivalent to computing the Mahalanobis distance on the original data?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.Applying an vector normalization to all data points to project them onto a unit hypersphere.
B.Min-Max scaling all features to the range .
C.Applying independent Z-score normalization (standardization) to each feature individually.
D.Transforming the data using Principal Component Analysis (PCA) and then dividing each principal component by its standard deviation (whitening).
Correct Answer: Transforming the data using Principal Component Analysis (PCA) and then dividing each principal component by its standard deviation (whitening).
Explanation:
Mahalanobis distance accounts for both the variance of each variable and the covariance between variables. PCA rotates the data to remove covariance (orthogonalizing the axes), and dividing by the standard deviation (whitening) scales the variances to 1. Euclidean distance in this whitened space is exactly equivalent to Mahalanobis distance in the original space.
Incorrect! Try again.
57Consider the Minkowski distance metric . As the parameter approaches infinity (), which widely known distance metric does this equation mathematically converge to?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.Mahalanobis distance
B.Chebyshev distance ( norm)
C.Manhattan distance ( norm)
D.Cosine distance
Correct Answer: Chebyshev distance ( norm)
Explanation:
As , the Minkowski distance becomes dominated entirely by the single dimension with the maximum absolute difference between the two vectors. This is the definition of the Chebyshev distance, also known as the norm or maximum metric: .
Incorrect! Try again.
58Why is Cosine similarity typically preferred over Euclidean distance when analyzing document similarity using TF-IDF (Term Frequency-Inverse Document Frequency) vectors?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.Because Euclidean distance is invalid for categorical distributions, whereas Cosine similarity implicitly performs a cross-entropy calculation.
B.Because TF-IDF requires non-negative distances, and Cosine similarity ensures distances strictly remain between 0 and 1, unlike Euclidean.
C.Because Euclidean distance is highly sensitive to the magnitude of the vectors, meaning a long document and a short document with the same topic distribution would appear artificially distant.
D.Because TF-IDF vectors are inherently dense, and Cosine similarity executes faster on dense matrix multiplications than Euclidean distance.
Correct Answer: Because Euclidean distance is highly sensitive to the magnitude of the vectors, meaning a long document and a short document with the same topic distribution would appear artificially distant.
Explanation:
In text mining, document length translates to vector magnitude in TF-IDF space. Two documents discussing the exact same topic but differing greatly in length (e.g., a summary vs. a full book) will have a very large Euclidean distance. Cosine similarity measures the angle between vectors, making it scale-invariant and focused entirely on the orientation (topic distribution).
Incorrect! Try again.
59If an unsupervised algorithm relies on updating cluster centroids using the arithmetic mean of the assigned points, but uses Cosine similarity for assignments (e.g., Spherical K-Means), what critical adjustment must be made to the centroid update step to maintain mathematical consistency?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.The centroids must be updated using the geometric mean rather than the arithmetic mean.
B.The centroids must be defined using the median of the vectors rather than the mean to minimize angular distance.
C.The centroids must be shifted by a constant factor of to account for angular variance.
D.The computed arithmetic mean centroid must be -normalized to project it back onto the unit hypersphere.
Correct Answer: The computed arithmetic mean centroid must be -normalized to project it back onto the unit hypersphere.
Explanation:
Spherical K-Means uses Cosine similarity, which implies operating on directional data (points on a unit hypersphere). The arithmetic mean of a set of unit vectors lies inside the hypersphere (its norm is less than 1). To serve as a valid cluster center in this space, it must be re-normalized (projected back) to unit length.
Incorrect! Try again.
60A dataset contains features representing specific geolocational grids in a city, where travel is strictly limited to an orthogonal street network. If one attempts to use Euclidean distance to cluster optimal distribution hubs, what specific topological error is being introduced?
Distance & similarity metrics: Euclidean, Manhattan, Cosine similarity, when and why distance choice matters
Hard
A.The model ignores the triangle inequality, allowing points to map outside the permissible grid.
B.The model will overestimate distances because the norm is strictly greater than the norm.
C.The model fails to account for the curse of dimensionality, causing grid points to collapse into a singular origin.
D.The model assumes traversal along the hypotenuse, systematically underestimating the true traversal cost between coordinate points.
Correct Answer: The model assumes traversal along the hypotenuse, systematically underestimating the true traversal cost between coordinate points.
Explanation:
An orthogonal street grid strictly follows Manhattan distance ( norm). Euclidean distance ( norm) computes the straight-line 'as-the-crow-flies' path (the hypotenuse), which is physically impossible in an orthogonal grid network. Therefore, Euclidean distance systematically underestimates the actual traversal distance/cost.