Unit 6 - Practice Quiz

CSE273 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 According to Tom Mitchell's definition, a computer program is said to learn from experience with respect to some class of tasks and performance measure if:

A. Its performance at tasks in , as measured by , improves with experience .
B. It can memorize the experience perfectly without error.
C. Its performance at tasks in remains constant regardless of .
D. It requires no prior knowledge to solve tasks in .

2 In the context of a 'Checkers Learning Problem', what represents the Task ()?

A. The percent of games won against opponents.
B. Playing checkers games.
C. Playing practice games against itself.
D. The rules of the game.

3 Which component of a learning system represents the set of all possible functions that the learning algorithm can select as the learned function?

A. The training set
B. The target function
C. The hypothesis space
D. The feature vector

4 In the Statistical Learning Framework, the data generation process assumes that data pairs are generated independently and identically distributed (i.i.d) according to:

A. A known Gaussian distribution.
B. A fixed but unknown probability distribution .
C. A uniform distribution over integers.
D. The user's manual input.

5 What is the primary goal of the Empirical Risk Minimization (ERM) principle?

A. To minimize the error on the unseen test data directly.
B. To minimize the average loss on the observed training data.
C. To maximize the size of the hypothesis space.
D. To minimize the computational time of the algorithm.

6 The True Risk (or Generalization Error) is defined as:

A. The average error on the training set.
B. The expectation of the loss function over the true distribution .
C. The difference between training error and validation error.
D. The square root of the bias.

7 Given a loss function , the Empirical Risk for a dataset is given by:

A.
B.
C.
D.

8 In PAC Learning, what does the parameter (epsilon) represent?

A. The probability that the hypothesis is incorrect.
B. The maximum allowed error (accuracy parameter).
C. The size of the training dataset.
D. The complexity of the hypothesis space.

9 In PAC Learning, what does the parameter (delta) represent?

A. The error rate of the classifier.
B. The probability that the learning algorithm fails to output a good hypothesis.
C. The learning rate of the gradient descent.
D. The dimensionality of the input space.

10 A learning algorithm is considered Consistent if:

A. It always produces the same hypothesis for different datasets.
B. It produces a hypothesis that makes zero errors on the training examples.
C. It has zero variance.
D. It does not require inductive bias.

11 Inductive Bias is necessary in machine learning because:

A. It speeds up the hardware processing.
B. Without it, a learner cannot generalize beyond the observed training examples.
C. It eliminates the need for training data.
D. It ensures the target function is always linear.

12 Which of the following is an example of a Restriction Bias (Language Bias)?

A. Preferring the shortest decision tree.
B. Limiting the hypothesis space to linear separators.
C. Using Gradient Descent to find weights.
D. Preferring hypotheses with larger margins.

13 Occam's Razor serves as a basis for which type of inductive bias?

A. Restriction Bias
B. Preference Bias
C. Sampling Bias
D. Confirmation Bias

14 The No Free Lunch Theorem essentially states that:

A. Deep learning is always superior to other methods.
B. If averaged over all possible data generating distributions, every classification algorithm has the same error rate.
C. More data always leads to better performance regardless of the algorithm.
D. Computational cost is the only constraint in learning.

15 What is Sample Complexity?

A. The time complexity required to process a sample.
B. The number of training examples required to learn a target function to within error with probability .
C. The complexity of the mathematical function used to generate samples.
D. The number of features in the dataset.

16 For a finite hypothesis space , the sample complexity bound for a consistent learner is roughly proportional to:

A.
B.
C.
D.

17 If a learning problem is Agnostic, it means:

A. The target function is guaranteed to be in the hypothesis space .
B. We do not assume the target function is contained within the hypothesis space .
C. The learner ignores the training data.
D. The labels are missing from the training set.

18 The VC Dimension (Vapnik-Chervonenkis) measures:

A. The computational speed of an algorithm.
B. The number of parameters in a neural network.
C. The capacity or complexity of a hypothesis space.
D. The size of the dataset.

19 A set of points is shattered by a hypothesis space if:

A. The points are linearly separable.
B. For every possible labeling of the points in , there exists a hypothesis in that classifies them correctly.
C. The points cannot be classified correctly by any .
D. The points are drawn from a uniform distribution.

20 What is the VC dimension of a linear classifier (perceptron) in 2-dimensional space ()?

A. 2
B. 3
C. 4
D. Infinite

21 Which of the following implies that a hypothesis space has infinite VC dimension?

A. It can shatter a dataset of size 3.
B. It can shatter datasets of arbitrarily large size.
C. It contains only linear functions.
D. It uses a Euclidean distance metric.

22 In the context of Model Selection, Structural Risk Minimization (SRM) aims to:

A. Minimize empirical risk only.
B. Balance empirical risk and the complexity of the hypothesis space.
C. Maximize the complexity of the hypothesis space.
D. Minimize the training time.

23 Which loss function is commonly used for regression problems in the statistical learning framework?

A. Zero-One Loss
B. Squared Error Loss
C. Hinge Loss
D. Cross-Entropy Loss

24 What is the consequence of having a hypothesis space with a VC dimension significantly higher than the number of training examples?

A. Underfitting
B. Overfitting
C. Convergence to the optimal solution.
D. Zero computational cost.

25 The inductive bias of the -Nearest Neighbor (-NN) algorithm is:

A. The decision boundary is linear.
B. The target function is a decision tree.
C. Points close to each other in feature space likely have the same label.
D. The features are statistically independent.

26 Which of the following statements about Prior Knowledge is TRUE?

A. It is useless in the era of Big Data.
B. It can reduce the sample complexity of a learning task.
C. It increases the likelihood of overfitting.
D. It is strictly forbidden in unsupervised learning.

27 In the inequality , what is this bound attempting to quantify?

A. The accuracy of the training data labels.
B. The generalization gap.
C. The computational speed.
D. The precision of the floating-point calculations.

28 What is a Hypothesis in machine learning?

A. A proven theorem.
B. A specific function mapping inputs to outputs selected from the hypothesis space.
C. The raw data collected.
D. The error metric used.

29 Which learning scenario involves a 'supervisor' providing correct labels?

A. Unsupervised Learning
B. Supervised Learning
C. Reinforcement Learning
D. Clustering

30 The Hoeffding Inequality provides a bound for:

A. The difference between the true mean and the empirical mean of a random variable.
B. The maximum depth of a decision tree.
C. The optimal number of clusters.
D. The convergence rate of gradient descent.

31 If a hypothesis space is finite, is it PAC-learnable?

A. No, never.
B. Yes, provided the target concept is in and we have enough samples.
C. Only if the size of is less than 10.
D. Only if the VC dimension is infinite.

32 The Bias-Variance Tradeoff implies that:

A. We should always minimize bias to zero.
B. Increasing model complexity decreases bias but increases variance.
C. Increasing model complexity increases bias but decreases variance.
D. Bias and variance are independent of model complexity.

33 Why is Zero-One Loss difficult to optimize directly?

A. It is always zero.
B. It is non-convex and not differentiable.
C. It requires infinite data.
D. It produces negative values.

34 Which of the following represents the Approximation Error?

A. The error due to finite training samples (variance).
B. The minimum possible risk achievable by a hypothesis in versus the true target function.
C. The error due to noise in labels.
D. The calculation error of the CPU.

35 The Fundamental Theorem of Statistical Learning relates PAC learnability to:

A. Neural Network depth.
B. Finite VC Dimension.
C. Gaussian distributions.
D. Unsupervised clustering.

36 If an algorithm chooses a hypothesis simply because it works well on training data, but has no theoretical justification for working on unseen data, it lacks:

A. Consistency
B. Generalization guarantees
C. Empirical accuracy
D. Optimization speed

37 In the context of the No Free Lunch Theorem, when is Algorithm A better than Algorithm B?

A. Always, if A is a Deep Neural Network.
B. Only with respect to a specific distribution or class of problems.
C. If A has more parameters than B.
D. If A runs faster than B.

38 What is the Estimation Error?

A. The error caused by selecting a specific hypothesis from using finite data instead of the best possible .
B. The error caused by the hypothesis space not containing the target.
C. The inherent noise in the system.
D. The error in measuring features.

39 Which inequality is primarily used to derive the sample complexity bound ?

A. Cauchy-Schwarz Inequality
B. Union Bound
C. Triangle Inequality
D. Jensen's Inequality

40 If we increase the confidence parameter (e.g., from 0.01 to 0.1), the required sample size:

A. Increases
B. Decreases
C. Stays the same
D. Becomes infinite

41 If we decrease the error parameter (e.g., from 0.1 to 0.01), the required sample size:

A. Increases
B. Decreases
C. Stays the same
D. Becomes zero

42 A hypothesis space consisting of all possible axis-aligned rectangles in 2D has a VC dimension of:

A. 2
B. 3
C. 4
D. 5

43 Which of the following is an assumption of the PAC framework?

A. The training and testing data are drawn from the same distribution.
B. The distribution changes over time.
C. The learner knows the distribution beforehand.
D. The noise level is exactly zero.

44 What role does Prior Knowledge play in the choice of a hypothesis space?

A. It allows the use of a larger, more complex space.
B. It suggests choosing a space that is likely to contain the target function but is not unnecessarily complex.
C. It ensures the VC dimension is infinite.
D. It removes the need for a loss function.

45 Validation Sets are used to:

A. Train the model parameters.
B. Estimate the generalization error and tune hyperparameters.
C. Increase the training set size.
D. Calculate the exact VC dimension.

46 In the context of the 'Curse of Dimensionality', as the number of features increases, the amount of data needed to generalize accurately:

A. Increases linearly.
B. Increases exponentially.
C. Decreases.
D. Remains constant.

47 Which strategy helps when the Sample Complexity is too high for the available data?

A. Increase the complexity of the model.
B. Use a stronger inductive bias (simpler model).
C. Decrease and .
D. Discard training data.

48 The Bayes Optimal Classifier represents:

A. The worst possible classifier.
B. The classifier with the minimum possible theoretical error rate.
C. A classifier that assumes all features are dependent.
D. A linear classifier.

49 An algorithm is Efficiently PAC-Learnable if:

A. It runs in polynomial time with respect to , and size().
B. It runs in exponential time.
C. It requires zero samples.
D. It guarantees 100% accuracy.

50 When choosing an algorithm based on data assumptions, if you assume the data is linearly separable with a large margin, which algorithm is theoretically most appropriate?

A. 1-Nearest Neighbor
B. Support Vector Machine (SVM)
C. Decision Stump
D. Naive Bayes