1What is the fundamental difference between the target variables in classification and regression problems?
A.Classification predicts discrete class labels, while regression predicts continuous numerical values.
B.Classification predicts continuous values, while regression predicts discrete categories.
C.Both predict continuous values, but regression uses a different loss function.
D.Classification requires unsupervised learning, while regression requires supervised learning.
Correct Answer: Classification predicts discrete class labels, while regression predicts continuous numerical values.
Explanation:
Regression models are used when the target variable is continuous (e.g., price, temperature), whereas classification models are used when the target variable is categorical or discrete (e.g., spam/not spam, dog/cat).
Incorrect! Try again.
2Which of the following scenarios is a regression problem?
A.Grouping customers into segments based on purchasing behavior.
B.Recognizing handwritten digits (0-9).
C.Predicting whether an email is spam or ham.
D.Predicting the price of a house based on its square footage.
Correct Answer: Predicting the price of a house based on its square footage.
Explanation:
House price is a continuous numerical variable, making this a regression task. The other options involve categorical classification or clustering.
Incorrect! Try again.
3In Simple Linear Regression, the relationship between the independent variable and the dependent variable is modeled as:
A.
B.
C.
D.
Correct Answer:
Explanation:
Simple linear regression models the relationship as a straight line, where is the intercept, is the slope, and is the error term.
Incorrect! Try again.
4Which statement regarding Polynomial Regression is true?
A.It strictly requires non-parametric methods.
B.It cannot be solved using Ordinary Least Squares (OLS).
C.It is a form of linear regression because it is linear in the parameters (coefficients).
D.It is considered a non-linear regression because the curve is non-linear.
Correct Answer: It is a form of linear regression because it is linear in the parameters (coefficients).
Explanation:
Although polynomial regression fits a non-linear curve to the data, the model is linear with respect to its coefficients (weights), and thus is technically considered a form of linear regression.
Incorrect! Try again.
5What happens if the degree of the polynomial in polynomial regression is chosen to be too high?
A.The model will underfit the data (High Bias).
B.The computational cost decreases significantly.
C.The model will generalize better to unseen data.
D.The model will overfit the data (High Variance).
Correct Answer: The model will overfit the data (High Variance).
Explanation:
A high-degree polynomial tries to pass through every data point, capturing noise rather than the underlying trend, leading to overfitting and poor generalization.
Incorrect! Try again.
6Which loss function is most commonly used for Ordinary Least Squares (OLS) regression?
A.Cross-Entropy Loss
B.Hinge Loss
C.Kullback-Leibler Divergence
D.Mean Squared Error (MSE)
Correct Answer: Mean Squared Error (MSE)
Explanation:
OLS regression aims to minimize the sum of squared residuals, which is directly equivalent to minimizing the Mean Squared Error (MSE).
Incorrect! Try again.
7The Mean Squared Error (MSE) is calculated as:
A.
B.
C.
D.
Correct Answer:
Explanation:
MSE is the average of the squares of the errors (the difference between actual values and predicted values ).
Incorrect! Try again.
8Which loss function is more robust to outliers in a regression problem?
A.Root Mean Squared Error (RMSE)
B.Mean Squared Error (MSE)
C.Mean Absolute Error (MAE)
D.L2 Norm
Correct Answer: Mean Absolute Error (MAE)
Explanation:
MAE minimizes the absolute differences (L1 norm). Unlike MSE, which squares the errors and penalizes large errors (outliers) heavily, MAE treats all errors linearly, making it more robust to outliers.
Incorrect! Try again.
9In the context of regression regularization, Lasso Regression adds which penalty term to the loss function?
A.No penalty term
B.L1 penalty (Absolute magnitude of coefficients: )
C.A combination of L1 and L2 penalties
D.L2 penalty (Squared magnitude of coefficients: )
Correct Answer: L1 penalty (Absolute magnitude of coefficients: )
Explanation:
Lasso (Least Absolute Shrinkage and Selection Operator) uses the L1 norm penalty, which can shrink some coefficients to exactly zero, effectively performing feature selection.
Incorrect! Try again.
10What is a defining characteristic of Non-Parametric Regression?
A.It assumes a fixed mathematical form (e.g., a line) with a finite set of parameters.
B.The number of parameters grows with the size of the training data.
C.It requires the data to be normally distributed.
D.It only works for classification problems.
Correct Answer: The number of parameters grows with the size of the training data.
Explanation:
Non-parametric methods (like KNN regression or Kernel regression) do not assume a fixed functional form. The 'parameters' are essentially the training data itself, so the complexity grows as data increases.
Incorrect! Try again.
11In K-Nearest Neighbors (KNN) regression, how is the prediction for a new data point made?
A.By taking the majority vote of the class labels of neighbors.
B.By solving a linear equation .
C.By taking the average (or weighted average) of the target values of the 'K' closest training neighbors.
D.By calculating the probability using Bayes' theorem.
Correct Answer: By taking the average (or weighted average) of the target values of the 'K' closest training neighbors.
Explanation:
For regression, KNN identifies the nearest neighbors and averages their continuous target values to predict the output for the query point.
Incorrect! Try again.
12Which of the following is true regarding the choice of 'k' in KNN regression?
A.A very large 'k' leads to overfitting (high variance).
B.A very small 'k' (e.g., k=1) leads to high variance (overfitting).
C.A very small 'k' (e.g., k=1) leads to high bias (underfitting).
D.The value of 'k' does not affect the model performance.
Correct Answer: A very small 'k' (e.g., k=1) leads to high variance (overfitting).
Explanation:
When is small, the model is very sensitive to local noise in the data (overfitting). As increases, the prediction becomes smoother (lower variance, potentially higher bias).
Incorrect! Try again.
13What is the primary difference between Supervised Learning (Classification/Regression) and Unsupervised Learning (Clustering)?
A.Supervised learning is faster than unsupervised learning.
B.Supervised learning groups data, while unsupervised learning predicts values.
C.Supervised learning requires labeled data (input-output pairs), while unsupervised learning uses unlabeled data.
Correct Answer: Supervised learning requires labeled data (input-output pairs), while unsupervised learning uses unlabeled data.
Explanation:
The key distinction is the presence of labels. Classification/Regression trains on known outcomes. Clustering finds hidden structures in data without pre-existing labels.
Incorrect! Try again.
14The Euclidean distance between two points and is given by:
A.
B.
C.
D.
Correct Answer:
Explanation:
Euclidean distance is the straight-line distance between two points, derived from the Pythagorean theorem ( norm).
Incorrect! Try again.
15Which distance measure corresponds to the norm and is calculated as the sum of absolute differences?
A.Chebyshev Distance
B.Cosine Distance
C.Euclidean Distance
D.Manhattan Distance
Correct Answer: Manhattan Distance
Explanation:
Manhattan distance (Taxicab geometry) measures distance along axes at right angles, calculated as .
Incorrect! Try again.
16Cosine Similarity is particularly useful for:
A.Time series forecasting.
B.Geometric clustering of low-dimensional data.
C.Calculating distance on a grid.
D.Measuring the similarity between text documents (represented as vectors) irrespective of magnitude.
Correct Answer: Measuring the similarity between text documents (represented as vectors) irrespective of magnitude.
Explanation:
Cosine similarity measures the cosine of the angle between two vectors. It focuses on orientation rather than magnitude, making it ideal for text analysis where document length might vary but topic composition is similar.
Incorrect! Try again.
17The Minkowski distance is a generalization of both Euclidean and Manhattan distances defined as . If , it becomes:
A.Euclidean Distance
B.Mahalanobis Distance
C.Manhattan Distance
D.Chebyshev Distance
Correct Answer: Manhattan Distance
Explanation:
When in the Minkowski formula, it results in the sum of absolute differences, which is the Manhattan distance.
Incorrect! Try again.
18Which of the following is a Partition-based clustering algorithm?
A.BIRCH
B.Agglomerative Clustering
C.DBSCAN
D.K-Means
Correct Answer: K-Means
Explanation:
K-Means is the classic partition-based method, where data is divided into non-overlapping groups (partitions) defined by centroids.
Incorrect! Try again.
19What is the objective function that the K-Means algorithm tries to minimize?
A.Within-Cluster Sum of Squares (WCSS)
B.The number of clusters
C.Between-Cluster Sum of Squares
D.Silhouette Coefficient
Correct Answer: Within-Cluster Sum of Squares (WCSS)
Explanation:
K-Means aims to minimize the variance within each cluster, calculated as the sum of squared distances between data points and their respective cluster centroids.
Incorrect! Try again.
20Which of the following is a step in the K-Means algorithm?
A.Merging the two closest clusters.
B.Selecting the 'k' nearest neighbors for voting.
C.Assigning points to the nearest cluster centroid.
D.Drawing a separating hyperplane.
Correct Answer: Assigning points to the nearest cluster centroid.
Explanation:
The standard Lloyd's algorithm for K-Means alternates between two steps: 1) Assignment (assign points to nearest centroid) and 2) Update (recalculate centroids based on assigned points).
Incorrect! Try again.
21A major limitation of the standard K-Means algorithm is:
A.It is computationally very expensive for small datasets.
B.It always finds the global optimum.
C.It works well with non-convex cluster shapes.
D.It requires the number of clusters to be specified in advance.
Correct Answer: It requires the number of clusters to be specified in advance.
Explanation:
K-Means cannot automatically determine the optimal number of clusters; the user must provide as a hyperparameter.
Incorrect! Try again.
22How does K-Medoids differ from K-Means?
A.K-Medoids uses actual data points as centers (medoids) and is more robust to outliers.
B.K-Medoids uses the mean of the points as the center.
C.K-Medoids uses Euclidean distance exclusively.
D.K-Medoids is faster than K-Means.
Correct Answer: K-Medoids uses actual data points as centers (medoids) and is more robust to outliers.
Explanation:
Because K-Medoids selects an actual data point as the center rather than averaging coordinates (which can be skewed by outliers), it is more robust to noise/outliers.
Incorrect! Try again.
23Hierarchical clustering can be divided into two main types:
A.Centroid-based and Density-based
B.Agglomerative (Bottom-Up) and Divisive (Top-Down)
C.Supervised and Unsupervised
D.Linear and Non-linear
Correct Answer: Agglomerative (Bottom-Up) and Divisive (Top-Down)
Explanation:
Agglomerative starts with clusters and merges them. Divisive starts with 1 cluster and splits it recursively.
Incorrect! Try again.
24In Agglomerative Hierarchical Clustering, what does 'Single Linkage' measure?
A.The maximum distance between points in two clusters.
B.The distance between the centroids of two clusters.
C.The average distance between all pairs of points in two clusters.
D.The minimum distance between the closest pair of points in two clusters.
Correct Answer: The minimum distance between the closest pair of points in two clusters.
Explanation:
Single linkage looks at the shortest distance between any single point in cluster A and any single point in cluster B. It often leads to the 'chaining' phenomenon.
Incorrect! Try again.
25What is a Dendrogram?
A.A scatter plot of the clusters.
B.A diagram representing the tree structure of hierarchical clustering.
C.A method to calculate the derivative of a function.
D.A plot showing the loss function over iterations.
Correct Answer: A diagram representing the tree structure of hierarchical clustering.
Explanation:
A dendrogram is a tree diagram used to illustrate the arrangement of the clusters produced by hierarchical clustering.
Incorrect! Try again.
26In hierarchical clustering, 'Complete Linkage' uses which distance metric to merge clusters?
A.Maximum distance between points (farthest neighbors).
B.Average distance between points.
C.Distance between centroids.
D.Minimum distance between points (nearest neighbors).
Correct Answer: Maximum distance between points (farthest neighbors).
Explanation:
Complete linkage considers the distance between the two clusters to be equal to the longest distance from any member of one cluster to any member of the other cluster.
Incorrect! Try again.
27Which clustering method does NOT require specifying the number of clusters upfront?
A.K-Means
B.Gaussian Mixture Models
C.K-Medoids
D.Hierarchical Clustering
Correct Answer: Hierarchical Clustering
Explanation:
Hierarchical clustering builds a tree (dendrogram). The number of clusters is determined by where you choose to 'cut' the tree, which can be decided after viewing the structure.
Incorrect! Try again.
28What is the Elbow Method used for?
A.To prevent overfitting in regression.
B.To visualize high-dimensional data.
C.To determine the optimal number of clusters () in K-Means.
D.To calculate the distance between clusters.
Correct Answer: To determine the optimal number of clusters () in K-Means.
Explanation:
The Elbow Method plots the WCSS against the number of clusters. The optimal is usually at the 'elbow' point where the rate of decrease in WCSS slows down significantly.
Incorrect! Try again.
29The Silhouette Score ranges between:
A.-infinity and +infinity
B.-1 and 1
C.0 and 100
D.0 and 1
Correct Answer: -1 and 1
Explanation:
The Silhouette Coefficient measures how similar an object is to its own cluster compared to other clusters. +1 indicates good clustering, 0 indicates overlapping, and -1 indicates wrong assignment.
Incorrect! Try again.
30A Silhouette Score close to +1 implies:
A.The clustering algorithm failed.
B.The point is well matched to its own cluster and far from neighboring clusters.
C.The point is assigned to the wrong cluster.
D.The point is on or very close to the decision boundary between two neighboring clusters.
Correct Answer: The point is well matched to its own cluster and far from neighboring clusters.
Explanation:
A high positive score indicates dense, well-separated clusters.
Incorrect! Try again.
31Which metric is used for cluster validation when ground truth labels are available?
A.Silhouette Score
B.Davies-Bouldin Index
C.Rand Index (or Adjusted Rand Index)
D.Elbow Method
Correct Answer: Rand Index (or Adjusted Rand Index)
Explanation:
The Rand Index compares the clustering results with the actual class labels (external validation) to measure similarity. The others are internal validation metrics used when labels are unknown.
Incorrect! Try again.
32In the context of Ridge Regression, as the penalty parameter approaches infinity, the regression coefficients tend towards:
A.The OLS estimates
B.1
C.Zero
D.Infinity
Correct Answer: Zero
Explanation:
A very high penalty dominates the loss function, forcing the optimization to minimize the coefficients' magnitude to near zero to reduce the penalty.
Incorrect! Try again.
33Which regression technique fits a local regression model to a subset of the data surrounding the query point?
LOESS (or LOWESS) is a non-parametric method that fits simple models (like linear regression) to localized subsets of data.
Incorrect! Try again.
34Jaccard Similarity is defined as:
A.
B.
C.
D.
Correct Answer:
Explanation:
Jaccard similarity measures the similarity between finite sample sets, defined as the size of the intersection divided by the size of the union.
Incorrect! Try again.
35K-Means++ is an algorithm used for:
A.Calculating the final centroids.
B.Initializing the cluster centers to improve convergence speed and quality.
C.Determining the value of K automatically.
D.Post-processing the clusters.
Correct Answer: Initializing the cluster centers to improve convergence speed and quality.
Explanation:
K-Means++ is a smart initialization technique that spreads out the initial centroids to avoid poor local minima, addressing the sensitivity of standard K-Means to initialization.
Incorrect! Try again.
36Which of the following data shapes is K-Means least likely to handle correctly?
A.Clusters with similar variances.
B.Compact, well-separated blobs.
C.Concentric circles (e.g., a donut shape).
D.Spherical clusters of equal size.
Correct Answer: Concentric circles (e.g., a donut shape).
Explanation:
K-Means assumes clusters are convex and spherical (using Euclidean distance). It fails on non-convex shapes like concentric rings or elongated moons.
Incorrect! Try again.
37The Dunn Index is an internal cluster validation metric where a higher value indicates:
A.Loose and overlapping clusters.
B.Compact and well-separated clusters.
C.Poor clustering performance.
D.High computational complexity.
Correct Answer: Compact and well-separated clusters.
Explanation:
The Dunn Index is the ratio of the minimum inter-cluster distance to the maximum intra-cluster diameter. We want the numerator high (separation) and denominator low (compactness), so a higher index is better.
Incorrect! Try again.
38Which statement regarding the bias-variance trade-off in regression is correct?
A.Complex non-linear models usually have low bias and high variance.
B.Variance refers to the error on the training set.
C.We want to maximize both bias and variance.
D.Simple linear models usually have low bias and high variance.
Correct Answer: Complex non-linear models usually have low bias and high variance.
Explanation:
Complex models fit the training data very well (low bias) but fluctuate wildly with different training sets (high variance), leading to overfitting.
Incorrect! Try again.
39What is Ward's Method in hierarchical clustering?
A.A divisive method that splits based on density.
B.A method equivalent to single linkage.
C.A method that uses random linkage.
D.An agglomerative method that minimizes the increase in total within-cluster variance when merging.
Correct Answer: An agglomerative method that minimizes the increase in total within-cluster variance when merging.
Explanation:
Ward's method is a variance-minimizing approach. At each step, it merges the pair of clusters that leads to the minimum increase in total within-cluster variance.
Incorrect! Try again.
40Hamming distance is primarily used for:
A.Geospatial coordinates.
B.Image pixel intensity.
C.Categorical data or strings of equal length.
D.Continuous numerical data.
Correct Answer: Categorical data or strings of equal length.
Explanation:
Hamming distance measures the number of positions at which corresponding symbols are different between two strings of equal length (e.g., '101' vs '111' distance is 1).
Incorrect! Try again.
41In kernel regression (e.g., Nadaraya-Watson), the 'bandwidth' parameter controls:
A.The smoothness of the fit (width of the kernel window).
B.The number of clusters.
C.The learning rate of the gradient descent.
D.The number of iterations.
Correct Answer: The smoothness of the fit (width of the kernel window).
Explanation:
The bandwidth determines how much influence neighbors have. A large bandwidth results in a smoother curve (high bias), while a small bandwidth results in a wiggly curve (high variance).
Incorrect! Try again.
42Which of the following is NOT a metric for calculating the distance between two clusters in hierarchical clustering?
A.Single Linkage
B.Complete Linkage
C.Gradient Descent
D.Average Linkage
Correct Answer: Gradient Descent
Explanation:
Gradient Descent is an optimization algorithm used to minimize loss functions, not a linkage criterion for measuring distance between clusters.
Incorrect! Try again.
43For a dataset with points, what is the time complexity of one iteration of K-Means with clusters and dimensions?
A.
B.
C.
D.
Correct Answer:
Explanation:
For every point (), we calculate the distance to every centroid () across all dimensions (). Hence linear in .
Incorrect! Try again.
44What is the main advantage of Hierarchical Clustering over K-Means?
A.It is computationally faster for large datasets.
B.It scales linearly with the number of data points.
C.It handles missing values natively.
D.It provides a taxonomy/hierarchy of clusters and doesn't require pre-specifying .
Correct Answer: It provides a taxonomy/hierarchy of clusters and doesn't require pre-specifying .
Explanation:
The resulting dendrogram provides rich structural information and allows the user to choose the granularity of clustering after the fact.
Incorrect! Try again.
45If a regression model has an (Coefficient of Determination) score of 1.0, it means:
A.The model is underfitting.
B.The model perfectly fits the data.
C.The model is a constant line.
D.The model explains none of the variability of the response data.
Correct Answer: The model perfectly fits the data.
Explanation:
represents the proportion of variance in the dependent variable explained by the model. A score of 1.0 implies perfect prediction of the training data.
Incorrect! Try again.
46Which of these is a 'lazy learning' algorithm often used for regression?
A.Linear Regression
B.K-Nearest Neighbors (KNN)
C.K-Means
D.Ridge Regression
Correct Answer: K-Nearest Neighbors (KNN)
Explanation:
KNN is a lazy learner because it does not learn a discriminative function from the training data but memorizes the training dataset instead. Computation happens only during prediction.
Incorrect! Try again.
47In the context of clustering, what is 'inter-cluster distance'?
A.The distance from a point to the origin.
B.The distance between different clusters.
C.The distance between points within the same cluster.
D.The sum of squared errors.
Correct Answer: The distance between different clusters.
Explanation:
Good clustering aims to maximize inter-cluster distance (separation) and minimize intra-cluster distance (compactness).
Incorrect! Try again.
48When using the Manhattan distance, the set of points at a constant distance from the origin forms a:
A.Sphere
B.Hyperbola
C.Square (rotated 45 degrees)
D.Circle
Correct Answer: Square (rotated 45 degrees)
Explanation:
In 2D, describes a square tilted by 45 degrees (a diamond shape).
Incorrect! Try again.
49Which statement regarding outlier sensitivity is correct?
A.K-Means is less sensitive to outliers than K-Medoids.
B.Median-based methods are more sensitive to outliers than Mean-based methods.
C.K-Means is sensitive to outliers because the mean is influenced by extreme values.
D.Least Squares Regression is robust to outliers.
Correct Answer: K-Means is sensitive to outliers because the mean is influenced by extreme values.
Explanation:
Since K-Means calculates the arithmetic mean of points to find the centroid, a distant outlier can significantly shift the centroid, distorting the cluster.
Incorrect! Try again.
50What is the 'Kernel Trick' in the context of non-linear regression (e.g., Support Vector Regression)?
A.Mapping data to a higher-dimensional space to make it linearly separable/fittable without explicitly calculating coordinates.
B.A method to reduce dimensionality.
C.Using a GPU kernel for faster processing.
D.Ignoring non-linear data points.
Correct Answer: Mapping data to a higher-dimensional space to make it linearly separable/fittable without explicitly calculating coordinates.
Explanation:
The kernel trick computes the inner products in a high-dimensional feature space, allowing algorithms to fit non-linear relationships efficiently.