1What is the fundamental difference between the target variables in classification and regression problems?
A.Classification predicts continuous values, while regression predicts discrete categories.
B.Classification predicts discrete class labels, while regression predicts continuous numerical values.
C.Both predict continuous values, but regression uses a different loss function.
D.Classification requires unsupervised learning, while regression requires supervised learning.
Correct Answer: Classification predicts discrete class labels, while regression predicts continuous numerical values.
Explanation:Regression models are used when the target variable is continuous (e.g., price, temperature), whereas classification models are used when the target variable is categorical or discrete (e.g., spam/not spam, dog/cat).
Incorrect! Try again.
2Which of the following scenarios is a regression problem?
A.Predicting whether an email is spam or ham.
B.Predicting the price of a house based on its square footage.
C.Recognizing handwritten digits (0-9).
D.Grouping customers into segments based on purchasing behavior.
Correct Answer: Predicting the price of a house based on its square footage.
Explanation:House price is a continuous numerical variable, making this a regression task. The other options involve categorical classification or clustering.
Incorrect! Try again.
3In Simple Linear Regression, the relationship between the independent variable and the dependent variable is modeled as:
A.
B.
C.
D.
Correct Answer:
Explanation:Simple linear regression models the relationship as a straight line, where is the intercept, is the slope, and is the error term.
Incorrect! Try again.
4Which statement regarding Polynomial Regression is true?
A.It is considered a non-linear regression because the curve is non-linear.
B.It is a form of linear regression because it is linear in the parameters (coefficients).
C.It cannot be solved using Ordinary Least Squares (OLS).
D.It strictly requires non-parametric methods.
Correct Answer: It is a form of linear regression because it is linear in the parameters (coefficients).
Explanation:Although polynomial regression fits a non-linear curve to the data, the model is linear with respect to its coefficients (weights), and thus is technically considered a form of linear regression.
Incorrect! Try again.
5What happens if the degree of the polynomial in polynomial regression is chosen to be too high?
A.The model will underfit the data (High Bias).
B.The model will generalize better to unseen data.
C.The model will overfit the data (High Variance).
D.The computational cost decreases significantly.
Correct Answer: The model will overfit the data (High Variance).
Explanation:A high-degree polynomial tries to pass through every data point, capturing noise rather than the underlying trend, leading to overfitting and poor generalization.
Incorrect! Try again.
6Which loss function is most commonly used for Ordinary Least Squares (OLS) regression?
A.Cross-Entropy Loss
B.Mean Squared Error (MSE)
C.Hinge Loss
D.Kullback-Leibler Divergence
Correct Answer: Mean Squared Error (MSE)
Explanation:OLS regression aims to minimize the sum of squared residuals, which is directly equivalent to minimizing the Mean Squared Error (MSE).
Incorrect! Try again.
7The Mean Squared Error (MSE) is calculated as:
A.
B.
C.
D.
Correct Answer:
Explanation:MSE is the average of the squares of the errors (the difference between actual values and predicted values ).
Incorrect! Try again.
8Which loss function is more robust to outliers in a regression problem?
A.Mean Squared Error (MSE)
B.Mean Absolute Error (MAE)
C.Root Mean Squared Error (RMSE)
D.L2 Norm
Correct Answer: Mean Absolute Error (MAE)
Explanation:MAE minimizes the absolute differences (L1 norm). Unlike MSE, which squares the errors and penalizes large errors (outliers) heavily, MAE treats all errors linearly, making it more robust to outliers.
Incorrect! Try again.
9In the context of regression regularization, Lasso Regression adds which penalty term to the loss function?
A.L2 penalty (Squared magnitude of coefficients: )
B.L1 penalty (Absolute magnitude of coefficients: )
C.A combination of L1 and L2 penalties
D.No penalty term
Correct Answer: L1 penalty (Absolute magnitude of coefficients: )
Explanation:Lasso (Least Absolute Shrinkage and Selection Operator) uses the L1 norm penalty, which can shrink some coefficients to exactly zero, effectively performing feature selection.
Incorrect! Try again.
10What is a defining characteristic of Non-Parametric Regression?
A.It assumes a fixed mathematical form (e.g., a line) with a finite set of parameters.
B.It requires the data to be normally distributed.
C.The number of parameters grows with the size of the training data.
D.It only works for classification problems.
Correct Answer: The number of parameters grows with the size of the training data.
Explanation:Non-parametric methods (like KNN regression or Kernel regression) do not assume a fixed functional form. The 'parameters' are essentially the training data itself, so the complexity grows as data increases.
Incorrect! Try again.
11In K-Nearest Neighbors (KNN) regression, how is the prediction for a new data point made?
A.By solving a linear equation .
B.By taking the average (or weighted average) of the target values of the 'K' closest training neighbors.
C.By taking the majority vote of the class labels of neighbors.
D.By calculating the probability using Bayes' theorem.
Correct Answer: By taking the average (or weighted average) of the target values of the 'K' closest training neighbors.
Explanation:For regression, KNN identifies the nearest neighbors and averages their continuous target values to predict the output for the query point.
Incorrect! Try again.
12Which of the following is true regarding the choice of 'k' in KNN regression?
A.A very large 'k' leads to overfitting (high variance).
B.A very small 'k' (e.g., k=1) leads to high bias (underfitting).
C.A very small 'k' (e.g., k=1) leads to high variance (overfitting).
D.The value of 'k' does not affect the model performance.
Correct Answer: A very small 'k' (e.g., k=1) leads to high variance (overfitting).
Explanation:When is small, the model is very sensitive to local noise in the data (overfitting). As increases, the prediction becomes smoother (lower variance, potentially higher bias).
Incorrect! Try again.
13What is the primary difference between Supervised Learning (Classification/Regression) and Unsupervised Learning (Clustering)?
A.Supervised learning requires labeled data (input-output pairs), while unsupervised learning uses unlabeled data.
B.Supervised learning is faster than unsupervised learning.
D.Supervised learning groups data, while unsupervised learning predicts values.
Correct Answer: Supervised learning requires labeled data (input-output pairs), while unsupervised learning uses unlabeled data.
Explanation:The key distinction is the presence of labels. Classification/Regression trains on known outcomes. Clustering finds hidden structures in data without pre-existing labels.
Incorrect! Try again.
14The Euclidean distance between two points and is given by:
A.
B.
C.
D.
Correct Answer:
Explanation:Euclidean distance is the straight-line distance between two points, derived from the Pythagorean theorem ( norm).
Incorrect! Try again.
15Which distance measure corresponds to the norm and is calculated as the sum of absolute differences?
A.Euclidean Distance
B.Manhattan Distance
C.Chebyshev Distance
D.Cosine Distance
Correct Answer: Manhattan Distance
Explanation:Manhattan distance (Taxicab geometry) measures distance along axes at right angles, calculated as .
Incorrect! Try again.
16Cosine Similarity is particularly useful for:
A.Geometric clustering of low-dimensional data.
B.Measuring the similarity between text documents (represented as vectors) irrespective of magnitude.
C.Calculating distance on a grid.
D.Time series forecasting.
Correct Answer: Measuring the similarity between text documents (represented as vectors) irrespective of magnitude.
Explanation:Cosine similarity measures the cosine of the angle between two vectors. It focuses on orientation rather than magnitude, making it ideal for text analysis where document length might vary but topic composition is similar.
Incorrect! Try again.
17The Minkowski distance is a generalization of both Euclidean and Manhattan distances defined as . If , it becomes:
A.Euclidean Distance
B.Manhattan Distance
C.Chebyshev Distance
D.Mahalanobis Distance
Correct Answer: Manhattan Distance
Explanation:When in the Minkowski formula, it results in the sum of absolute differences, which is the Manhattan distance.
Incorrect! Try again.
18Which of the following is a Partition-based clustering algorithm?
A.DBSCAN
B.Agglomerative Clustering
C.K-Means
D.BIRCH
Correct Answer: K-Means
Explanation:K-Means is the classic partition-based method, where data is divided into non-overlapping groups (partitions) defined by centroids.
Incorrect! Try again.
19What is the objective function that the K-Means algorithm tries to minimize?
A.Within-Cluster Sum of Squares (WCSS)
B.Between-Cluster Sum of Squares
C.Silhouette Coefficient
D.The number of clusters
Correct Answer: Within-Cluster Sum of Squares (WCSS)
Explanation:K-Means aims to minimize the variance within each cluster, calculated as the sum of squared distances between data points and their respective cluster centroids.
Incorrect! Try again.
20Which of the following is a step in the K-Means algorithm?
A.Merging the two closest clusters.
B.Assigning points to the nearest cluster centroid.
C.Drawing a separating hyperplane.
D.Selecting the 'k' nearest neighbors for voting.
Correct Answer: Assigning points to the nearest cluster centroid.
Explanation:The standard Lloyd's algorithm for K-Means alternates between two steps: 1) Assignment (assign points to nearest centroid) and 2) Update (recalculate centroids based on assigned points).
Incorrect! Try again.
21A major limitation of the standard K-Means algorithm is:
A.It is computationally very expensive for small datasets.
B.It requires the number of clusters to be specified in advance.
C.It always finds the global optimum.
D.It works well with non-convex cluster shapes.
Correct Answer: It requires the number of clusters to be specified in advance.
Explanation:K-Means cannot automatically determine the optimal number of clusters; the user must provide as a hyperparameter.
Incorrect! Try again.
22How does K-Medoids differ from K-Means?
A.K-Medoids uses the mean of the points as the center.
B.K-Medoids uses actual data points as centers (medoids) and is more robust to outliers.
C.K-Medoids is faster than K-Means.
D.K-Medoids uses Euclidean distance exclusively.
Correct Answer: K-Medoids uses actual data points as centers (medoids) and is more robust to outliers.
Explanation:Because K-Medoids selects an actual data point as the center rather than averaging coordinates (which can be skewed by outliers), it is more robust to noise/outliers.
Incorrect! Try again.
23Hierarchical clustering can be divided into two main types:
A.Linear and Non-linear
B.Agglomerative (Bottom-Up) and Divisive (Top-Down)
C.Supervised and Unsupervised
D.Centroid-based and Density-based
Correct Answer: Agglomerative (Bottom-Up) and Divisive (Top-Down)
Explanation:Agglomerative starts with clusters and merges them. Divisive starts with 1 cluster and splits it recursively.
Incorrect! Try again.
24In Agglomerative Hierarchical Clustering, what does 'Single Linkage' measure?
A.The distance between the centroids of two clusters.
B.The maximum distance between points in two clusters.
C.The minimum distance between the closest pair of points in two clusters.
D.The average distance between all pairs of points in two clusters.
Correct Answer: The minimum distance between the closest pair of points in two clusters.
Explanation:Single linkage looks at the shortest distance between any single point in cluster A and any single point in cluster B. It often leads to the 'chaining' phenomenon.
Incorrect! Try again.
25What is a Dendrogram?
A.A diagram representing the tree structure of hierarchical clustering.
B.A plot showing the loss function over iterations.
C.A scatter plot of the clusters.
D.A method to calculate the derivative of a function.
Correct Answer: A diagram representing the tree structure of hierarchical clustering.
Explanation:A dendrogram is a tree diagram used to illustrate the arrangement of the clusters produced by hierarchical clustering.
Incorrect! Try again.
26In hierarchical clustering, 'Complete Linkage' uses which distance metric to merge clusters?
A.Distance between centroids.
B.Minimum distance between points (nearest neighbors).
C.Maximum distance between points (farthest neighbors).
D.Average distance between points.
Correct Answer: Maximum distance between points (farthest neighbors).
Explanation:Complete linkage considers the distance between the two clusters to be equal to the longest distance from any member of one cluster to any member of the other cluster.
Incorrect! Try again.
27Which clustering method does NOT require specifying the number of clusters upfront?
A.K-Means
B.K-Medoids
C.Hierarchical Clustering
D.Gaussian Mixture Models
Correct Answer: Hierarchical Clustering
Explanation:Hierarchical clustering builds a tree (dendrogram). The number of clusters is determined by where you choose to 'cut' the tree, which can be decided after viewing the structure.
Incorrect! Try again.
28What is the Elbow Method used for?
A.To calculate the distance between clusters.
B.To determine the optimal number of clusters () in K-Means.
C.To prevent overfitting in regression.
D.To visualize high-dimensional data.
Correct Answer: To determine the optimal number of clusters () in K-Means.
Explanation:The Elbow Method plots the WCSS against the number of clusters. The optimal is usually at the 'elbow' point where the rate of decrease in WCSS slows down significantly.
Incorrect! Try again.
29The Silhouette Score ranges between:
A.0 and 1
B.-1 and 1
C.0 and 100
D.-infinity and +infinity
Correct Answer: -1 and 1
Explanation:The Silhouette Coefficient measures how similar an object is to its own cluster compared to other clusters. +1 indicates good clustering, 0 indicates overlapping, and -1 indicates wrong assignment.
Incorrect! Try again.
30A Silhouette Score close to +1 implies:
A.The point is well matched to its own cluster and far from neighboring clusters.
B.The point is on or very close to the decision boundary between two neighboring clusters.
C.The point is assigned to the wrong cluster.
D.The clustering algorithm failed.
Correct Answer: The point is well matched to its own cluster and far from neighboring clusters.
Explanation:A high positive score indicates dense, well-separated clusters.
Incorrect! Try again.
31Which metric is used for cluster validation when ground truth labels are available?
A.Silhouette Score
B.Davies-Bouldin Index
C.Rand Index (or Adjusted Rand Index)
D.Elbow Method
Correct Answer: Rand Index (or Adjusted Rand Index)
Explanation:The Rand Index compares the clustering results with the actual class labels (external validation) to measure similarity. The others are internal validation metrics used when labels are unknown.
Incorrect! Try again.
32In the context of Ridge Regression, as the penalty parameter approaches infinity, the regression coefficients tend towards:
A.Infinity
B.Zero
C.The OLS estimates
D.1
Correct Answer: Zero
Explanation:A very high penalty dominates the loss function, forcing the optimization to minimize the coefficients' magnitude to near zero to reduce the penalty.
Incorrect! Try again.
33Which regression technique fits a local regression model to a subset of the data surrounding the query point?
Explanation:LOESS (or LOWESS) is a non-parametric method that fits simple models (like linear regression) to localized subsets of data.
Incorrect! Try again.
34Jaccard Similarity is defined as:
A.
B.
C.
D.
Correct Answer:
Explanation:Jaccard similarity measures the similarity between finite sample sets, defined as the size of the intersection divided by the size of the union.
Incorrect! Try again.
35K-Means++ is an algorithm used for:
A.Calculating the final centroids.
B.Initializing the cluster centers to improve convergence speed and quality.
C.Post-processing the clusters.
D.Determining the value of K automatically.
Correct Answer: Initializing the cluster centers to improve convergence speed and quality.
Explanation:K-Means++ is a smart initialization technique that spreads out the initial centroids to avoid poor local minima, addressing the sensitivity of standard K-Means to initialization.
Incorrect! Try again.
36Which of the following data shapes is K-Means least likely to handle correctly?
A.Spherical clusters of equal size.
B.Compact, well-separated blobs.
C.Concentric circles (e.g., a donut shape).
D.Clusters with similar variances.
Correct Answer: Concentric circles (e.g., a donut shape).
Explanation:K-Means assumes clusters are convex and spherical (using Euclidean distance). It fails on non-convex shapes like concentric rings or elongated moons.
Incorrect! Try again.
37The Dunn Index is an internal cluster validation metric where a higher value indicates:
A.Compact and well-separated clusters.
B.Loose and overlapping clusters.
C.High computational complexity.
D.Poor clustering performance.
Correct Answer: Compact and well-separated clusters.
Explanation:The Dunn Index is the ratio of the minimum inter-cluster distance to the maximum intra-cluster diameter. We want the numerator high (separation) and denominator low (compactness), so a higher index is better.
Incorrect! Try again.
38Which statement regarding the bias-variance trade-off in regression is correct?
A.Simple linear models usually have low bias and high variance.
B.Complex non-linear models usually have low bias and high variance.
C.We want to maximize both bias and variance.
D.Variance refers to the error on the training set.
Correct Answer: Complex non-linear models usually have low bias and high variance.
Explanation:Complex models fit the training data very well (low bias) but fluctuate wildly with different training sets (high variance), leading to overfitting.
Incorrect! Try again.
39What is Ward's Method in hierarchical clustering?
A.A divisive method that splits based on density.
B.An agglomerative method that minimizes the increase in total within-cluster variance when merging.
C.A method that uses random linkage.
D.A method equivalent to single linkage.
Correct Answer: An agglomerative method that minimizes the increase in total within-cluster variance when merging.
Explanation:Ward's method is a variance-minimizing approach. At each step, it merges the pair of clusters that leads to the minimum increase in total within-cluster variance.
Incorrect! Try again.
40Hamming distance is primarily used for:
A.Continuous numerical data.
B.Categorical data or strings of equal length.
C.Geospatial coordinates.
D.Image pixel intensity.
Correct Answer: Categorical data or strings of equal length.
Explanation:Hamming distance measures the number of positions at which corresponding symbols are different between two strings of equal length (e.g., '101' vs '111' distance is 1).
Incorrect! Try again.
41In kernel regression (e.g., Nadaraya-Watson), the 'bandwidth' parameter controls:
A.The number of clusters.
B.The smoothness of the fit (width of the kernel window).
C.The learning rate of the gradient descent.
D.The number of iterations.
Correct Answer: The smoothness of the fit (width of the kernel window).
Explanation:The bandwidth determines how much influence neighbors have. A large bandwidth results in a smoother curve (high bias), while a small bandwidth results in a wiggly curve (high variance).
Incorrect! Try again.
42Which of the following is NOT a metric for calculating the distance between two clusters in hierarchical clustering?
A.Single Linkage
B.Complete Linkage
C.Average Linkage
D.Gradient Descent
Correct Answer: Gradient Descent
Explanation:Gradient Descent is an optimization algorithm used to minimize loss functions, not a linkage criterion for measuring distance between clusters.
Incorrect! Try again.
43For a dataset with points, what is the time complexity of one iteration of K-Means with clusters and dimensions?
A.
B.
C.
D.
Correct Answer:
Explanation:For every point (), we calculate the distance to every centroid () across all dimensions (). Hence linear in .
Incorrect! Try again.
44What is the main advantage of Hierarchical Clustering over K-Means?
A.It is computationally faster for large datasets.
B.It provides a taxonomy/hierarchy of clusters and doesn't require pre-specifying .
C.It scales linearly with the number of data points.
D.It handles missing values natively.
Correct Answer: It provides a taxonomy/hierarchy of clusters and doesn't require pre-specifying .
Explanation:The resulting dendrogram provides rich structural information and allows the user to choose the granularity of clustering after the fact.
Incorrect! Try again.
45If a regression model has an (Coefficient of Determination) score of 1.0, it means:
A.The model explains none of the variability of the response data.
B.The model perfectly fits the data.
C.The model is underfitting.
D.The model is a constant line.
Correct Answer: The model perfectly fits the data.
Explanation: represents the proportion of variance in the dependent variable explained by the model. A score of 1.0 implies perfect prediction of the training data.
Incorrect! Try again.
46Which of these is a 'lazy learning' algorithm often used for regression?
A.Linear Regression
B.K-Nearest Neighbors (KNN)
C.K-Means
D.Ridge Regression
Correct Answer: K-Nearest Neighbors (KNN)
Explanation:KNN is a lazy learner because it does not learn a discriminative function from the training data but memorizes the training dataset instead. Computation happens only during prediction.
Incorrect! Try again.
47In the context of clustering, what is 'inter-cluster distance'?
A.The distance between points within the same cluster.
B.The distance between different clusters.
C.The distance from a point to the origin.
D.The sum of squared errors.
Correct Answer: The distance between different clusters.
Explanation:Good clustering aims to maximize inter-cluster distance (separation) and minimize intra-cluster distance (compactness).
Incorrect! Try again.
48When using the Manhattan distance, the set of points at a constant distance from the origin forms a:
A.Circle
B.Square (rotated 45 degrees)
C.Sphere
D.Hyperbola
Correct Answer: Square (rotated 45 degrees)
Explanation:In 2D, describes a square tilted by 45 degrees (a diamond shape).
Incorrect! Try again.
49Which statement regarding outlier sensitivity is correct?
A.K-Means is less sensitive to outliers than K-Medoids.
B.Least Squares Regression is robust to outliers.
C.K-Means is sensitive to outliers because the mean is influenced by extreme values.
D.Median-based methods are more sensitive to outliers than Mean-based methods.
Correct Answer: K-Means is sensitive to outliers because the mean is influenced by extreme values.
Explanation:Since K-Means calculates the arithmetic mean of points to find the centroid, a distant outlier can significantly shift the centroid, distorting the cluster.
Incorrect! Try again.
50What is the 'Kernel Trick' in the context of non-linear regression (e.g., Support Vector Regression)?
A.A method to reduce dimensionality.
B.Mapping data to a higher-dimensional space to make it linearly separable/fittable without explicitly calculating coordinates.
C.Using a GPU kernel for faster processing.
D.Ignoring non-linear data points.
Correct Answer: Mapping data to a higher-dimensional space to make it linearly separable/fittable without explicitly calculating coordinates.
Explanation:The kernel trick computes the inner products in a high-dimensional feature space, allowing algorithms to fit non-linear relationships efficiently.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.