1According to Tom Mitchell's definition, a computer program is said to learn from experience with respect to some class of tasks and performance measure if:
A.Its performance at tasks in , as measured by , improves with experience .
B.It can memorize the experience perfectly without error.
C.Its performance at tasks in remains constant regardless of .
D.It requires no prior knowledge to solve tasks in .
Correct Answer: Its performance at tasks in , as measured by , improves with experience .
Explanation:This is the formal definition of a well-posed learning problem provided by Tom Mitchell. Learning is characterized by performance improvement on specific tasks based on experience.
Incorrect! Try again.
2In the context of a 'Checkers Learning Problem', what represents the Task ()?
A.The percent of games won against opponents.
B.Playing checkers games.
C.Playing practice games against itself.
D.The rules of the game.
Correct Answer: Playing checkers games.
Explanation: is the task the system is performing (playing checkers), is the performance measure (percent of games won), and is the experience (practice games).
Incorrect! Try again.
3Which component of a learning system represents the set of all possible functions that the learning algorithm can select as the learned function?
A.The training set
B.The target function
C.The hypothesis space
D.The feature vector
Correct Answer: The hypothesis space
Explanation:The hypothesis space () is the set of all legal hypotheses (functions) that the algorithm can explore and select from to approximate the target function.
Incorrect! Try again.
4In the Statistical Learning Framework, the data generation process assumes that data pairs are generated independently and identically distributed (i.i.d) according to:
A.A known Gaussian distribution.
B.A fixed but unknown probability distribution .
C.A uniform distribution over integers.
D.The user's manual input.
Correct Answer: A fixed but unknown probability distribution .
Explanation:The standard statistical learning framework assumes there is an underlying, fixed, but unknown joint probability distribution (or ) from which data is sampled.
Incorrect! Try again.
5What is the primary goal of the Empirical Risk Minimization (ERM) principle?
A.To minimize the error on the unseen test data directly.
B.To minimize the average loss on the observed training data.
C.To maximize the size of the hypothesis space.
D.To minimize the computational time of the algorithm.
Correct Answer: To minimize the average loss on the observed training data.
Explanation:ERM seeks to find a hypothesis that minimizes the empirical risk (average loss) calculated over the given training sample , as a proxy for the true risk.
Incorrect! Try again.
6The True Risk (or Generalization Error) is defined as:
A.The average error on the training set.
B.The expectation of the loss function over the true distribution .
C.The difference between training error and validation error.
D.The square root of the bias.
Correct Answer: The expectation of the loss function over the true distribution .
Explanation:Mathematically, . It represents the expected error on future unseen data drawn from the same distribution.
Incorrect! Try again.
7Given a loss function , the Empirical Risk for a dataset is given by:
A.
B.
C.
D.
Correct Answer:
Explanation:Empirical risk is the arithmetic mean of the loss calculated over the specific finite training set of size .
Incorrect! Try again.
8In PAC Learning, what does the parameter (epsilon) represent?
A.The probability that the hypothesis is incorrect.
B.The maximum allowed error (accuracy parameter).
C.The size of the training dataset.
D.The complexity of the hypothesis space.
Correct Answer: The maximum allowed error (accuracy parameter).
Explanation:In Probably Approximately Correct (PAC) learning, we want the error of the hypothesis to be at most . It defines the 'Approximately' part of PAC.
Incorrect! Try again.
9In PAC Learning, what does the parameter (delta) represent?
A.The error rate of the classifier.
B.The probability that the learning algorithm fails to output a good hypothesis.
C.The learning rate of the gradient descent.
D.The dimensionality of the input space.
Correct Answer: The probability that the learning algorithm fails to output a good hypothesis.
Explanation: is the confidence parameter. We want the probability of producing a bad hypothesis (error ) to be at most . This is the 'Probably' part of PAC.
Incorrect! Try again.
10A learning algorithm is considered Consistent if:
A.It always produces the same hypothesis for different datasets.
B.It produces a hypothesis that makes zero errors on the training examples.
C.It has zero variance.
D.It does not require inductive bias.
Correct Answer: It produces a hypothesis that makes zero errors on the training examples.
Explanation:Consistency in this context means the hypothesis fits the training data perfectly (assuming the target function is in the hypothesis space).
Incorrect! Try again.
11Inductive Bias is necessary in machine learning because:
A.It speeds up the hardware processing.
B.Without it, a learner cannot generalize beyond the observed training examples.
C.It eliminates the need for training data.
D.It ensures the target function is always linear.
Correct Answer: Without it, a learner cannot generalize beyond the observed training examples.
Explanation:Without inductive bias, a learner can only memorize the training data. To predict unseen data, the learner must make assumptions (bias) about the structure of the target function.
Incorrect! Try again.
12Which of the following is an example of a Restriction Bias (Language Bias)?
A.Preferring the shortest decision tree.
B.Limiting the hypothesis space to linear separators.
C.Using Gradient Descent to find weights.
D.Preferring hypotheses with larger margins.
Correct Answer: Limiting the hypothesis space to linear separators.
Explanation:Restriction bias strictly limits the set of hypotheses considered (e.g., only linear models). Preference bias defines a preference ordering within the space (e.g., simpler trees).
Incorrect! Try again.
13Occam's Razor serves as a basis for which type of inductive bias?
A.Restriction Bias
B.Preference Bias
C.Sampling Bias
D.Confirmation Bias
Correct Answer: Preference Bias
Explanation:Occam's Razor suggests preferring the simplest hypothesis that fits the data. This is a preference (or search) bias, not a hard restriction on what is possible.
Incorrect! Try again.
14The No Free Lunch Theorem essentially states that:
A.Deep learning is always superior to other methods.
B.If averaged over all possible data generating distributions, every classification algorithm has the same error rate.
C.More data always leads to better performance regardless of the algorithm.
D.Computational cost is the only constraint in learning.
Correct Answer: If averaged over all possible data generating distributions, every classification algorithm has the same error rate.
Explanation:The NFL theorem asserts that no single learning algorithm is universally superior. Superiority is only possible given specific assumptions about the problem domain.
Incorrect! Try again.
15What is Sample Complexity?
A.The time complexity required to process a sample.
B.The number of training examples required to learn a target function to within error with probability .
C.The complexity of the mathematical function used to generate samples.
D.The number of features in the dataset.
Correct Answer: The number of training examples required to learn a target function to within error with probability .
Explanation:Sample complexity quantifies the amount of data needed to achieve a specific level of generalization performance in the PAC framework.
Incorrect! Try again.
16For a finite hypothesis space , the sample complexity bound for a consistent learner is roughly proportional to:
A.
B.
C.
D.
Correct Answer:
Explanation:The bound is generally , making it logarithmic with respect to the size of the hypothesis space.
Incorrect! Try again.
17If a learning problem is Agnostic, it means:
A.The target function is guaranteed to be in the hypothesis space .
B.We do not assume the target function is contained within the hypothesis space .
C.The learner ignores the training data.
D.The labels are missing from the training set.
Correct Answer: We do not assume the target function is contained within the hypothesis space .
Explanation:Agnostic learning (or the unrealizable setting) assumes the true target concept might not be representable by our model class, so we look for the best approximation.
C.The capacity or complexity of a hypothesis space.
D.The size of the dataset.
Correct Answer: The capacity or complexity of a hypothesis space.
Explanation:VC Dimension is a measure of the capacity of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter.
Incorrect! Try again.
19A set of points is shattered by a hypothesis space if:
A.The points are linearly separable.
B.For every possible labeling of the points in , there exists a hypothesis in that classifies them correctly.
C.The points cannot be classified correctly by any .
D.The points are drawn from a uniform distribution.
Correct Answer: For every possible labeling of the points in , there exists a hypothesis in that classifies them correctly.
Explanation:Shattering means the hypothesis space is expressive enough to separate the points regardless of how they are labeled (binary classification).
Incorrect! Try again.
20What is the VC dimension of a linear classifier (perceptron) in 2-dimensional space ()?
A.2
B.3
C.4
D.Infinite
Correct Answer: 3
Explanation:In dimensions, the VC dimension of a linear separator is . For , it is 3.
Incorrect! Try again.
21Which of the following implies that a hypothesis space has infinite VC dimension?
A.It can shatter a dataset of size 3.
B.It can shatter datasets of arbitrarily large size.
C.It contains only linear functions.
D.It uses a Euclidean distance metric.
Correct Answer: It can shatter datasets of arbitrarily large size.
Explanation:If for any integer , there exists a set of size that can be shattered, the VC dimension is infinite (e.g., 1-Nearest Neighbor or Sine waves).
Incorrect! Try again.
22In the context of Model Selection, Structural Risk Minimization (SRM) aims to:
A.Minimize empirical risk only.
B.Balance empirical risk and the complexity of the hypothesis space.
C.Maximize the complexity of the hypothesis space.
D.Minimize the training time.
Correct Answer: Balance empirical risk and the complexity of the hypothesis space.
Explanation:SRM adds a penalty term for model complexity to the empirical risk, effectively trading off between fitting the data well and keeping the model simple (regularization).
Incorrect! Try again.
23Which loss function is commonly used for regression problems in the statistical learning framework?
A.Zero-One Loss
B.Squared Error Loss
C.Hinge Loss
D.Cross-Entropy Loss
Correct Answer: Squared Error Loss
Explanation:Squared error loss () is the standard loss function for continuous regression problems.
Incorrect! Try again.
24What is the consequence of having a hypothesis space with a VC dimension significantly higher than the number of training examples?
A.Underfitting
B.Overfitting
C.Convergence to the optimal solution.
D.Zero computational cost.
Correct Answer: Overfitting
Explanation:If the model capacity (VC dim) is much larger than the number of samples, the model can memorize noise, leading to low training error but high generalization error (overfitting).
Incorrect! Try again.
25The inductive bias of the -Nearest Neighbor (-NN) algorithm is:
A.The decision boundary is linear.
B.The target function is a decision tree.
C.Points close to each other in feature space likely have the same label.
D.The features are statistically independent.
Correct Answer: Points close to each other in feature space likely have the same label.
Explanation:-NN relies on the assumption of smoothness or locality: similar inputs yield similar outputs.
Incorrect! Try again.
26Which of the following statements about Prior Knowledge is TRUE?
A.It is useless in the era of Big Data.
B.It can reduce the sample complexity of a learning task.
C.It increases the likelihood of overfitting.
D.It is strictly forbidden in unsupervised learning.
Correct Answer: It can reduce the sample complexity of a learning task.
Explanation:Incorporating prior knowledge (e.g., via constraints or specific hypothesis spaces) reduces the effective search space, allowing the model to learn from fewer examples.
Incorrect! Try again.
27In the inequality , what is this bound attempting to quantify?
A.The accuracy of the training data labels.
B.The generalization gap.
C.The computational speed.
D.The precision of the floating-point calculations.
Correct Answer: The generalization gap.
Explanation:This inequality bounds the difference between the observed training error and the true error, ensuring that the empirical performance is a reliable indicator of true performance.
Incorrect! Try again.
28What is a Hypothesis in machine learning?
A.A proven theorem.
B.A specific function mapping inputs to outputs selected from the hypothesis space.
C.The raw data collected.
D.The error metric used.
Correct Answer: A specific function mapping inputs to outputs selected from the hypothesis space.
Explanation:A hypothesis is a candidate model or function that attempts to approximate the target function.
Incorrect! Try again.
29Which learning scenario involves a 'supervisor' providing correct labels?
A.Unsupervised Learning
B.Supervised Learning
C.Reinforcement Learning
D.Clustering
Correct Answer: Supervised Learning
Explanation:Supervised learning is defined by the presence of labeled training data (input-output pairs).
Incorrect! Try again.
30The Hoeffding Inequality provides a bound for:
A.The difference between the true mean and the empirical mean of a random variable.
B.The maximum depth of a decision tree.
C.The optimal number of clusters.
D.The convergence rate of gradient descent.
Correct Answer: The difference between the true mean and the empirical mean of a random variable.
Explanation:Hoeffding's inequality bounds the probability that the sum (or average) of random variables deviates from its expected value. It is crucial for proving PAC bounds.
Incorrect! Try again.
31If a hypothesis space is finite, is it PAC-learnable?
A.No, never.
B.Yes, provided the target concept is in and we have enough samples.
C.Only if the size of is less than 10.
D.Only if the VC dimension is infinite.
Correct Answer: Yes, provided the target concept is in and we have enough samples.
Explanation:Finite hypothesis spaces are PAC-learnable. The sample complexity is polynomial in , , and .
Incorrect! Try again.
32The Bias-Variance Tradeoff implies that:
A.We should always minimize bias to zero.
B.Increasing model complexity decreases bias but increases variance.
C.Increasing model complexity increases bias but decreases variance.
D.Bias and variance are independent of model complexity.
Correct Answer: Increasing model complexity decreases bias but increases variance.
Explanation:Complex models fit training data well (low bias) but fluctuate wildly with different datasets (high variance). Simple models are stable (low variance) but may miss patterns (high bias).
Incorrect! Try again.
33Why is Zero-One Loss difficult to optimize directly?
A.It is always zero.
B.It is non-convex and not differentiable.
C.It requires infinite data.
D.It produces negative values.
Correct Answer: It is non-convex and not differentiable.
Explanation:0/1 loss is a step function (constant regions with jumps), making gradients zero or undefined, preventing gradient-based optimization.
Incorrect! Try again.
34Which of the following represents the Approximation Error?
A.The error due to finite training samples (variance).
B.The minimum possible risk achievable by a hypothesis in versus the true target function.
C.The error due to noise in labels.
D.The calculation error of the CPU.
Correct Answer: The minimum possible risk achievable by a hypothesis in versus the true target function.
Explanation:Approximation error is the error incurred because the hypothesis space might not contain the true target function. It is a measure of inductive bias limitations.
Incorrect! Try again.
35The Fundamental Theorem of Statistical Learning relates PAC learnability to:
A.Neural Network depth.
B.Finite VC Dimension.
C.Gaussian distributions.
D.Unsupervised clustering.
Correct Answer: Finite VC Dimension.
Explanation:The theorem states that a hypothesis class is PAC-learnable if and only if its VC dimension is finite.
Incorrect! Try again.
36If an algorithm chooses a hypothesis simply because it works well on training data, but has no theoretical justification for working on unseen data, it lacks:
A.Consistency
B.Generalization guarantees
C.Empirical accuracy
D.Optimization speed
Correct Answer: Generalization guarantees
Explanation:Without theoretical bounds (like PAC or VC), good training performance does not mathematically guarantee good test performance.
Incorrect! Try again.
37In the context of the No Free Lunch Theorem, when is Algorithm A better than Algorithm B?
A.Always, if A is a Deep Neural Network.
B.Only with respect to a specific distribution or class of problems.
C.If A has more parameters than B.
D.If A runs faster than B.
Correct Answer: Only with respect to a specific distribution or class of problems.
Explanation:NFL states that performance is tied to specific problem domains. One algorithm outperforms another only if its inductive bias matches the specific problem.
Incorrect! Try again.
38What is the Estimation Error?
A.The error caused by selecting a specific hypothesis from using finite data instead of the best possible .
B.The error caused by the hypothesis space not containing the target.
C.The inherent noise in the system.
D.The error in measuring features.
Correct Answer: The error caused by selecting a specific hypothesis from using finite data instead of the best possible .
Explanation:Estimation error arises because we have limited training data, so we might pick a sub-optimal hypothesis from (variance).
Incorrect! Try again.
39Which inequality is primarily used to derive the sample complexity bound ?
A.Cauchy-Schwarz Inequality
B.Union Bound
C.Triangle Inequality
D.Jensen's Inequality
Correct Answer: Union Bound
Explanation:The derivation typically sums the probabilities of bad hypotheses using the Union Bound, combined with an exponential bound like Hoeffding's.
Incorrect! Try again.
40If we increase the confidence parameter (e.g., from 0.01 to 0.1), the required sample size:
A.Increases
B.Decreases
C.Stays the same
D.Becomes infinite
Correct Answer: Decreases
Explanation:Higher means we are accepting a higher probability of failure (lower confidence). Therefore, we need fewer samples.
Incorrect! Try again.
41If we decrease the error parameter (e.g., from 0.1 to 0.01), the required sample size:
A.Increases
B.Decreases
C.Stays the same
D.Becomes zero
Correct Answer: Increases
Explanation:Lower means we demand higher accuracy. This requires significantly more training data.
Incorrect! Try again.
42A hypothesis space consisting of all possible axis-aligned rectangles in 2D has a VC dimension of:
A.2
B.3
C.4
D.5
Correct Answer: 4
Explanation:You can shatter 4 points (in a diamond shape) with axis-aligned rectangles, but not 5.
Incorrect! Try again.
43Which of the following is an assumption of the PAC framework?
A.The training and testing data are drawn from the same distribution.
B.The distribution changes over time.
C.The learner knows the distribution beforehand.
D.The noise level is exactly zero.
Correct Answer: The training and testing data are drawn from the same distribution.
Explanation:Stationarity (same fixed distribution for train and test) is a core assumption for standard PAC bounds to hold.
Incorrect! Try again.
44What role does Prior Knowledge play in the choice of a hypothesis space?
A.It allows the use of a larger, more complex space.
B.It suggests choosing a space that is likely to contain the target function but is not unnecessarily complex.
C.It ensures the VC dimension is infinite.
D.It removes the need for a loss function.
Correct Answer: It suggests choosing a space that is likely to contain the target function but is not unnecessarily complex.
Explanation:Prior knowledge guides the selection of to minimize approximation error while keeping estimation error (complexity) manageable.
Incorrect! Try again.
45Validation Sets are used to:
A.Train the model parameters.
B.Estimate the generalization error and tune hyperparameters.
C.Increase the training set size.
D.Calculate the exact VC dimension.
Correct Answer: Estimate the generalization error and tune hyperparameters.
Explanation:Validation data is held out from training to act as a proxy for test data, helping to select the best model or hyperparameters.
Incorrect! Try again.
46In the context of the 'Curse of Dimensionality', as the number of features increases, the amount of data needed to generalize accurately:
A.Increases linearly.
B.Increases exponentially.
C.Decreases.
D.Remains constant.
Correct Answer: Increases exponentially.
Explanation:The volume of the space increases exponentially with dimension, making data sparse. To maintain density/coverage, exponentially more data is required.
Incorrect! Try again.
47Which strategy helps when the Sample Complexity is too high for the available data?
A.Increase the complexity of the model.
B.Use a stronger inductive bias (simpler model).
C.Decrease and .
D.Discard training data.
Correct Answer: Use a stronger inductive bias (simpler model).
Explanation:Using a simpler model (lower VC dim) reduces sample complexity, though it risks increasing approximation error (bias).
Incorrect! Try again.
48The Bayes Optimal Classifier represents:
A.The worst possible classifier.
B.The classifier with the minimum possible theoretical error rate.
C.A classifier that assumes all features are dependent.
D.A linear classifier.
Correct Answer: The classifier with the minimum possible theoretical error rate.
Explanation:The Bayes Optimal Classifier assigns the most probable class based on the true underlying probability distribution. No classifier can have a lower error rate on average.
Incorrect! Try again.
49An algorithm is Efficiently PAC-Learnable if:
A.It runs in polynomial time with respect to , and size().
B.It runs in exponential time.
C.It requires zero samples.
D.It guarantees 100% accuracy.
Correct Answer: It runs in polynomial time with respect to , and size().
Explanation:Efficiency in PAC learning refers to computational complexity. The algorithm must produce the hypothesis using resources bounded polynomially by the problem parameters.
Incorrect! Try again.
50When choosing an algorithm based on data assumptions, if you assume the data is linearly separable with a large margin, which algorithm is theoretically most appropriate?
A.1-Nearest Neighbor
B.Support Vector Machine (SVM)
C.Decision Stump
D.Naive Bayes
Correct Answer: Support Vector Machine (SVM)
Explanation:SVMs are designed to maximize the margin between classes, making them the theoretical ideal for data assumed to have a large margin linear separation.