1 $What is the fundamental difference between the target variables in classification and regression problems?$

A.

Classification predicts continuous values, while regression predicts discrete categories.

B.

Classification predicts discrete class labels, while regression predicts continuous numerical values.

C.

Both predict continuous values, but regression uses a different loss function.

D.

Classification requires unsupervised learning, while regression requires supervised learning.

2 $Which of the following scenarios is a regression problem?$

A.

Predicting whether an email is spam or ham.

B.

Predicting the price of a house based on its square footage.

C.

Recognizing handwritten digits (0-9).

D.

Grouping customers into segments based on purchasing behavior.

3 $In Simple Linear Regression, the relationship between the independent variable and the dependent variable is modeled as:$

A.

B.

C.

D.

4 $Which statement regarding Polynomial Regression is true?$

A.

It is considered a non-linear regression because the curve is non-linear.

B.

It is a form of linear regression because it is linear in the parameters (coefficients).

C.

It cannot be solved using Ordinary Least Squares (OLS).

D.

It strictly requires non-parametric methods.

5 $What happens if the degree of the polynomial in polynomial regression is chosen to be too high?$

A.

The model will underfit the data (High Bias).

B.

The model will generalize better to unseen data.

C.

The model will overfit the data (High Variance).

D.

The computational cost decreases significantly.

6 $Which loss function is most commonly used for Ordinary Least Squares (OLS) regression?$

A.

Cross-Entropy Loss

B.

Mean Squared Error (MSE)

C.

Hinge Loss

D.

Kullback-Leibler Divergence

7 $The Mean Squared Error (MSE) is calculated as:$

A.

B.

C.

D.

8 $Which loss function is more robust to outliers in a regression problem?$

A.

Mean Squared Error (MSE)

B.

Mean Absolute Error (MAE)

C.

Root Mean Squared Error (RMSE)

D.

L2 Norm

9 $In the context of regression regularization, Lasso Regression adds which penalty term to the loss function?$

A.

L2 penalty (Squared magnitude of coefficients:)

B.

L1 penalty (Absolute magnitude of coefficients:)

C.

A combination of L1 and L2 penalties

D.

No penalty term

10 $What is a defining characteristic of Non-Parametric Regression?$

A.

It assumes a fixed mathematical form (e.g., a line) with a finite set of parameters.

B.

It requires the data to be normally distributed.

C.

The number of parameters grows with the size of the training data.

D.

It only works for classification problems.

11 $In K-Nearest Neighbors (KNN) regression, how is the prediction for a new data point made?$

A.

By solving a linear equation .

B.

By taking the average (or weighted average) of the target values of the 'K' closest training neighbors.

C.

By taking the majority vote of the class labels of neighbors.

D.

By calculating the probability using Bayes' theorem.

12 $Which of the following is true regarding the choice of 'k' in KNN regression?$

A.

A very large 'k' leads to overfitting (high variance).

B.

A very small 'k' (e.g., k=1) leads to high bias (underfitting).

C.

A very small 'k' (e.g., k=1) leads to high variance (overfitting).

D.

The value of 'k' does not affect the model performance.

13 $What is the primary difference between Supervised Learning (Classification/Regression) and Unsupervised Learning (Clustering)?$

A.

Supervised learning requires labeled data (input-output pairs), while unsupervised learning uses unlabeled data.

B.

Supervised learning is faster than unsupervised learning.

C.

Unsupervised learning always yields better accuracy.

D.

Supervised learning groups data, while unsupervised learning predicts values.

14 $The Euclidean distance between two points and is given by:$

A.

B.

C.

D.

15 $Which distance measure corresponds to the norm and is calculated as the sum of absolute differences?$

A.

Euclidean Distance

B.

Manhattan Distance

C.

Chebyshev Distance

D.

Cosine Distance

16 $Cosine Similarity is particularly useful for:$

A.

Geometric clustering of low-dimensional data.

B.

Measuring the similarity between text documents (represented as vectors) irrespective of magnitude.

C.

Calculating distance on a grid.

D.

Time series forecasting.

17 $The Minkowski distance is a generalization of both Euclidean and Manhattan distances defined as . If, it becomes:$

A.

Euclidean Distance

B.

Manhattan Distance

C.

Chebyshev Distance

D.

Mahalanobis Distance

18 $Which of the following is a Partition-based clustering algorithm?$

A.

DBSCAN

B.

Agglomerative Clustering

C.

K-Means

D.

BIRCH

19 $What is the objective function that the K-Means algorithm tries to minimize?$

A.

Within-Cluster Sum of Squares (WCSS)

B.

Between-Cluster Sum of Squares

C.

Silhouette Coefficient

D.

The number of clusters

20 $Which of the following is a step in the K-Means algorithm?$

A.

Merging the two closest clusters.

B.

Assigning points to the nearest cluster centroid.

C.

Drawing a separating hyperplane.

D.

Selecting the 'k' nearest neighbors for voting.

21 $A major limitation of the standard K-Means algorithm is:$

A.

It is computationally very expensive for small datasets.

B.

It requires the number of clusters to be specified in advance.

C.

It always finds the global optimum.

D.

It works well with non-convex cluster shapes.

22 $How does K-Medoids differ from K-Means?$

A.

K-Medoids uses the mean of the points as the center.

B.

K-Medoids uses actual data points as centers (medoids) and is more robust to outliers.

C.

K-Medoids is faster than K-Means.

D.

K-Medoids uses Euclidean distance exclusively.

23 $Hierarchical clustering can be divided into two main types:$

A.

Linear and Non-linear

B.

Agglomerative (Bottom-Up) and Divisive (Top-Down)

C.

Supervised and Unsupervised

D.

Centroid-based and Density-based

24 $In Agglomerative Hierarchical Clustering, what does 'Single Linkage' measure?$

A.

The distance between the centroids of two clusters.

B.

The maximum distance between points in two clusters.

C.

The minimum distance between the closest pair of points in two clusters.

D.

The average distance between all pairs of points in two clusters.

25 $What is a Dendrogram ?$

A.

A diagram representing the tree structure of hierarchical clustering.

B.

A plot showing the loss function over iterations.

C.

A scatter plot of the clusters.

D.

A method to calculate the derivative of a function.

26 $In hierarchical clustering, 'Complete Linkage' uses which distance metric to merge clusters?$

A.

Distance between centroids.

B.

Minimum distance between points (nearest neighbors).

C.

Maximum distance between points (farthest neighbors).

D.

Average distance between points.

27 $Which clustering method does NOT require specifying the number of clusters upfront?$

A.

K-Means

B.

K-Medoids

C.

Hierarchical Clustering

D.

Gaussian Mixture Models

28 $What is the Elbow Method used for?$

A.

To calculate the distance between clusters.

B.

To determine the optimal number of clusters () in K-Means.

C.

To prevent overfitting in regression.

D.

To visualize high-dimensional data.

29 $The Silhouette Score ranges between:$

A.

0 and 1

B.

-1 and 1

C.

0 and 100

D.

-infinity and +infinity

30 $A Silhouette Score close to +1 implies:$

A.

The point is well matched to its own cluster and far from neighboring clusters.

B.

The point is on or very close to the decision boundary between two neighboring clusters.

C.

The point is assigned to the wrong cluster.

D.

The clustering algorithm failed.

31 $Which metric is used for cluster validation when ground truth labels are available?$

A.

Silhouette Score

B.

Davies-Bouldin Index

C.

Rand Index (or Adjusted Rand Index)

D.

Elbow Method

32 $In the context of Ridge Regression, as the penalty parameter approaches infinity, the regression coefficients tend towards:$

A.

Infinity

B.

Zero

C.

The OLS estimates

D.

1

33 $Which regression technique fits a local regression model to a subset of the data surrounding the query point?$

A.

Linear Regression

B.

LOESS (Locally Estimated Scatterplot Smoothing)

C.

Ridge Regression

D.

Logistic Regression

34 $Jaccard Similarity is defined as:$

A.

B.

C.

D.

35 $K-Means++ is an algorithm used for:$

A.

Calculating the final centroids.

B.

Initializing the cluster centers to improve convergence speed and quality.

C.

Post-processing the clusters.

D.

Determining the value of K automatically.

36 $Which of the following data shapes is K-Means least likely to handle correctly?$

A.

Spherical clusters of equal size.

B.

Compact, well-separated blobs.

C.

Concentric circles (e.g., a donut shape).

D.

Clusters with similar variances.

37 $The Dunn Index is an internal cluster validation metric where a higher value indicates:$

A.

Compact and well-separated clusters.

B.

Loose and overlapping clusters.

C.

High computational complexity.

D.

Poor clustering performance.

38 $Which statement regarding the bias-variance trade-off in regression is correct?$

A.

Simple linear models usually have low bias and high variance.

B.

Complex non-linear models usually have low bias and high variance.

C.

We want to maximize both bias and variance.

D.

Variance refers to the error on the training set.

39 $What is Ward's Method in hierarchical clustering?$

A.

A divisive method that splits based on density.

B.

An agglomerative method that minimizes the increase in total within-cluster variance when merging.

C.

A method that uses random linkage.

D.

A method equivalent to single linkage.

40 $Hamming distance is primarily used for:$

A.

Continuous numerical data.

B.

Categorical data or strings of equal length.

C.

Geospatial coordinates.

D.

Image pixel intensity.

41 $In kernel regression (e.g., Nadaraya-Watson), the 'bandwidth' parameter controls:$

A.

The number of clusters.

B.

The smoothness of the fit (width of the kernel window).

C.

The learning rate of the gradient descent.

D.

The number of iterations.

42 $Which of the following is NOT a metric for calculating the distance between two clusters in hierarchical clustering?$

A.

Single Linkage

B.

Complete Linkage

C.

Average Linkage

D.

Gradient Descent

43 $For a dataset with points, what is the time complexity of one iteration of K-Means with clusters and dimensions?$

A.

B.

C.

D.

44 $What is the main advantage of Hierarchical Clustering over K-Means?$

A.

It is computationally faster for large datasets.

B.

It provides a taxonomy/hierarchy of clusters and doesn't require pre-specifying .

C.

It scales linearly with the number of data points.

D.

It handles missing values natively.

45 $If a regression model has an (Coefficient of Determination) score of 1.0, it means:$

A.

The model explains none of the variability of the response data.

B.

The model perfectly fits the data.

C.

The model is underfitting.

D.

The model is a constant line.

46 $Which of these is a 'lazy learning' algorithm often used for regression?$

A.

Linear Regression

B.

K-Nearest Neighbors (KNN)

C.

K-Means

D.

Ridge Regression

47 $In the context of clustering, what is 'inter-cluster distance'?$

A.

The distance between points within the same cluster.

B.

The distance between different clusters.

C.

The distance from a point to the origin.

D.

The sum of squared errors.

48 $When using the Manhattan distance, the set of points at a constant distance from the origin forms a:$

A.

Circle

B.

Square (rotated 45 degrees)

C.

Sphere

D.

Hyperbola

49 $Which statement regarding outlier sensitivity is correct?$

A.

K-Means is less sensitive to outliers than K-Medoids.

B.

Least Squares Regression is robust to outliers.

C.

K-Means is sensitive to outliers because the mean is influenced by extreme values.

D.

Median-based methods are more sensitive to outliers than Mean-based methods.

50 $What is the 'Kernel Trick' in the context of non-linear regression (e.g., Support Vector Regression)?$

A.

A method to reduce dimensionality.

B.

Mapping data to a higher-dimensional space to make it linearly separable/fittable without explicitly calculating coordinates.

C.

Using a GPU kernel for faster processing.

D.

Ignoring non-linear data points.

Unit 4 - Practice Quiz