1 $According to Tom Mitchell's definition, which three components define a well-posed learning problem?$

Well-Posed Learning Problems Easy

A.

Algorithm (A), Data (D), and Model (M)

B.

Input (I), Output (O), and Function (F)

C.

Features (F), Labels (L), and Weights (W)

D.

Task (T), Performance measure (P), and Experience (E)

2 $For a system that learns to play chess, what would the 'Experience' (E) component be?$

Well-Posed Learning Problems Easy

A.

The percentage of games it wins

B.

The experience of playing many games of chess

C.

The goal of winning the game

D.

The set of rules for how chess pieces move

3 $In a supervised learning system, what is the role of a 'loss function'?$

Components of a Learning System Easy

A.

To select the features for the model

B.

To measure the error or discrepancy between the model's prediction and the true label

C.

To define the model's architecture

D.

To store the training data

4 $Which component of a learning system is the algorithm that adjusts the model's parameters based on the training data and loss function?$

Components of a Learning System Easy

A.

The Hypothesis Space

B.

The Feature Extractor

C.

The Optimizer

D.

The Dataset

5 $In the statistical learning framework, what is the ultimate goal of a learning algorithm?$

Statistical Learning Framework Easy

A.

To find a function that minimizes the error on unseen future data

B.

To find the most complex model possible

C.

To memorize the entire training dataset

D.

To achieve 100% accuracy on the training data

6 $What is assumed about the training and test data in the standard statistical learning framework?$

Statistical Learning Framework Easy

A.

The training data contains all possible examples

B.

The test data is a subset of the training data

C.

The test data is completely different from the training data

D.

They are drawn independently and identically distributed (i.i.d.) from the same underlying distribution

7 $What is 'Empirical Risk'?$

Empirical Risk Minimization Easy

A.

The maximum possible error a model can make

B.

The expected loss of a model over the true data distribution

C.

The risk of choosing the wrong machine learning algorithm

D.

The average loss of a model over the training data

8 $Why do we use Empirical Risk Minimization (ERM) as a proxy for minimizing the 'True Risk'?$

Empirical Risk Minimization Easy

A.

Because the true data distribution is unknown, so we cannot calculate the true risk directly

B.

Because it guarantees the model will not overfit

C.

Because minimizing empirical risk is computationally faster

D.

Because empirical risk is always a more accurate measure than true risk

9 $What is the 'inductive bias' of a learning algorithm?$

Inductive Bias Easy

A.

The set of assumptions the learner uses to generalize from training data to unseen data

B.

The tendency of a model to be biased against a certain group

C.

The computational complexity of the algorithm

D.

The amount of error on the training set

10 $A decision tree algorithm prefers smaller, shorter trees over more complex ones. What is this an example of?$

Inductive Bias Easy

A.

Data Leakage

B.

Sample Complexity

C.

Overfitting

D.

Inductive Bias

11 $How does incorporating 'prior knowledge' help a machine learning model?$

Role of Prior Knowledge Easy

A.

It eliminates the need for a training dataset

B.

It helps the model generalize better, especially with limited data

C.

It guarantees the model will achieve zero error

D.

It makes the model more complex

12 $In Bayesian learning, what is used to formally represent prior knowledge about model parameters before observing any data?$

Role of Prior Knowledge Easy

A.

The prior distribution

B.

The loss function

C.

The likelihood function

D.

The posterior distribution

13 $In the PAC learning framework, the term 'Approximately Correct' means that the learned hypothesis has a...$

Probably Approximately Correct Learning Easy

A.

simple mathematical form.

B.

high degree of confidence.

C.

zero training error.

D.

low generalization error.

14 $What does the 'Probably' part of PAC learning guarantee?$

Probably Approximately Correct Learning Easy

A.

That the algorithm will produce a low-error hypothesis with high probability ()

B.

That the data is probably not noisy

C.

That the hypothesis is probably the single best one in existence

D.

That the algorithm will probably finish running in a reasonable amount of time

15 $What is 'sample complexity' in the context of machine learning?$

Sample Complexity Easy

A.

The amount of time it takes to train a model

B.

The complexity of the machine learning model itself (e.g., number of parameters)

C.

The number of training examples required to learn a concept to a desired level of accuracy

D.

The complexity of the features in the dataset

16 $In general, if you want to learn a more complex hypothesis class (e.g., a very deep neural network vs. a linear model), what happens to the sample complexity?$

Sample Complexity Easy

A.

It decreases

B.

It stays the same

C.

It increases

D.

It becomes zero

17 $What is the central message of the 'No Free Lunch' theorem?$

No Free Lunch Theorem Easy

A.

No single learning algorithm is the best for every possible problem

B.

There is no such thing as a free dataset

C.

An algorithm that is simple is always better than a complex one

D.

Machine learning models should always be free and open-source

18 $The No Free Lunch theorem highlights the importance of what in choosing a machine learning algorithm?$

No Free Lunch Theorem Easy

A.

Making assumptions about the data and the problem

B.

Ignoring the data and picking a model at random

C.

Always choosing the newest algorithm

D.

Using the algorithm with the fastest runtime

19 $If your primary goal is to have a model that is highly interpretable (easy to explain), which of the following would be a good choice?$

Choosing Algorithms Based on Data and Assumptions Easy

A.

A simple Decision Tree or Linear Regression

B.

A Gradient Boosting model with thousands of estimators

C.

A Support Vector Machine with a complex kernel

D.

A large, deep Neural Network

20 $You are given a dataset with many categorical features and your task is a classification problem. Which algorithm is often a good and simple baseline model for this type of data?$

Choosing Algorithms Based on Data and Assumptions Easy

A.

Principal Component Analysis (PCA)

B.

Linear Regression

C.

K-Means Clustering

D.

Naive Bayes

21 $A machine learning engineer is tasked with building a system to "predict the stock market." The system is given historical price data (Experience) and its success is measured by profit (Performance). Why is this learning problem considered ill-posed?$

Well-Posed Learning Problems Medium

A.

The learning algorithm is not specified in the problem definition.

B.

The experience (historical data) is insufficient for any learning.

C.

The performance measure (profit) is not quantifiable.

D.

The task of 'predicting the stock market' is too ambiguous and lacks a specific, targetable function to learn.

22 $A model is trained using Empirical Risk Minimization (ERM) on a finite training set . If the model achieves an empirical risk of, what can be said about its true risk, ?$

Empirical Risk Minimization Medium

A.

The true risk cannot be estimated, so no conclusion can be drawn.

B.

The true risk is also guaranteed to be 0.

C.

The true risk is likely greater than 0 due to potential overfitting.

D.

The true risk is guaranteed to be less than the empirical risk of any other hypothesis.

23 $Consider training a linear regression model to fit a set of data points. What is the primary inductive bias of this learning algorithm?$

Inductive Bias Medium

A.

The preference for decision boundaries with the maximum margin.

B.

The assumption that the underlying relationship between features and the target is a linear function.

C.

The belief that the simplest hypothesis that fits the data is the best (Occam's Razor).

D.

The assumption that all features are conditionally independent given the target value.

24 $In the PAC learning framework, a learning algorithm's guarantee is given with parameters (error) and (confidence). What is the correct interpretation of this guarantee?$

Probably Approximately Correct Learning Medium

A.

With a probability of at least, the algorithm will produce a hypothesis with a true error of at most .

B.

The algorithm is guaranteed to converge to the optimal hypothesis with an error of .

C.

The algorithm's training error will be less than and its validation error will be less than .

D.

The algorithm will produce a hypothesis with true error greater than with a probability of at least .

25 $According to PAC learning theory, if a data scientist wants to decrease the desired error parameter by half (i.e., make the hypothesis twice as accurate) while keeping the confidence parameter constant, how does the required sample complexity (number of training examples) typically change?$

Sample Complexity Medium

A.

It remains unchanged, as it only depends on .

B.

It approximately quadruples.

C.

It approximately doubles.

D.

It is approximately halved.

26 $An engineer finds that a deep neural network significantly outperforms a logistic regression model on an image classification task. A colleague, citing the No Free Lunch (NFL) Theorem, argues this doesn't mean neural networks are inherently superior. What is the core implication of the NFL Theorem that supports the colleague's argument?$

No Free Lunch Theorem Medium

A.

Logistic regression is a linear model and thus can never perform well on image data.

B.

The neural network must have been overfitted to the test data.

C.

Averaged over all possible data distributions, the two algorithms would have the same performance; the neural network's superiority is specific to this type of data distribution.

D.

The theorem states all models are equivalent, so the performance difference must be due to random chance.

27 $What is the central assumption in the standard statistical learning framework that connects the training data to the test data, and allows for meaningful generalization?$

Statistical Learning Framework Medium

A.

The loss function is convex and differentiable.

B.

The number of features is less than the number of samples.

C.

The data is linearly separable.

D.

The training and test data are independently and identically distributed (IID) from an unknown underlying distribution.

28 $You are tasked with building a credit risk model. The regulatory body requires that every loan rejection must be accompanied by a clear reason. Given this constraint, which of the following models would be most appropriate, even if it has slightly lower predictive accuracy than a competitor?$

Choosing Algorithms Based on Data and Assumptions Medium

A.

A gradient boosting machine (e.g., XGBoost) with 500 trees.

B.

A deep neural network with 10 hidden layers.

C.

A support vector machine with a radial basis function (RBF) kernel.

D.

A logistic regression model or a simple decision tree.

29 $In the context of Bayesian inference for machine learning, how is prior knowledge about model parameters formally incorporated into the learning process?$

Role of Prior Knowledge Medium

A.

By choosing a specific loss function to minimize during empirical risk minimization.

B.

By using regularization terms like L1 or L2 penalties.

C.

By selecting a subset of features based on domain expertise before training.

D.

By defining a prior probability distribution,, over the hypothesis space.

30 $A data scientist is training a classifier for a highly imbalanced dataset where the positive class is rare but important (e.g., fraud detection). They decide to switch the performance metric (part of the learning problem's definition) from 'Accuracy' to 'F1-Score'. What is the most likely reason for this change?$

Components of a Learning System Medium

A.

Accuracy can be misleadingly high if the model simply predicts the majority class, whereas F1-Score balances precision and recall.

B.

F1-Score is easier to compute than Accuracy.

C.

F1-Score ensures that the learned decision boundary will be linear.

D.

To make the model more robust to outliers in the feature space.

31 $Structural Risk Minimization (SRM) is an extension of Empirical Risk Minimization (ERM). What key problem with ERM does SRM attempt to solve?$

Empirical Risk Minimization Medium

A.

ERM does not control for model complexity, often leading to overfitting.

B.

ERM does not provide a mechanism to handle non-linear data.

C.

ERM is too computationally expensive for large datasets.

D.

ERM can only be applied to classification, not regression problems.

32 $Which of the following statements best describes the relationship between the No Free Lunch Theorem and inductive bias?$

Inductive Bias Medium

A.

The No Free Lunch Theorem proves that learning without any inductive bias is the most effective strategy.

B.

The theorem highlights the necessity of inductive bias, as generalization is only possible by making assumptions that align with the specific problem.

C.

The theorem implies that a strong inductive bias is always better than a weak one.

D.

The theorem states that inductive bias is only useful for linear models.

33 $Consider two hypothesis classes, (linear classifiers) and (quadratic classifiers). The VC dimension of is greater than that of . To achieve the same PAC guarantee (same and), what can be said about the sample complexity for compared to for ?$

Sample Complexity Medium

A.

, because simpler models need more data to be confident.

B.

The relationship is indeterminate and depends on the specific data distribution.

C.

, because sample complexity only depends on and .

D.

, because a more complex hypothesis class requires more data to learn reliably.

34 $What does it mean for a concept class C to be 'agnostic' PAC-learnable?$

Probably Approximately Correct Learning Medium

A.

The learning algorithm does not need to know the values of and in advance.

B.

The learning algorithm can achieve zero error on the training data.

C.

The algorithm works for both classification and regression problems.

D.

The algorithm can successfully learn the target concept even if it is not in the algorithm's hypothesis class .

35 $A team is building a system to automatically label emails as 'Spam' or 'Not Spam' (Task). They will use a labeled dataset of 100,000 emails (Experience). They define the performance measure as "the system should feel intuitive to the user." Why is this an ill-posed learning problem?$

Well-Posed Learning Problems Medium

A.

The problem does not specify the features to be extracted from the emails.

B.

The task is too simple for a machine learning approach.

C.

The performance measure is subjective and not objectively measurable.

D.

The experience (100,000 emails) is not large enough.

36 $In the statistical learning framework, what is the fundamental trade-off that a learning algorithm must manage to achieve good generalization performance?$

Statistical Learning Framework Medium

A.

The trade-off between the number of features and the number of samples.

B.

The trade-off between training speed and prediction speed.

C.

The trade-off between supervised and unsupervised learning.

D.

The trade-off between approximation error (bias) and estimation error (variance).

37 $Applying an L1 regularization (LASSO) penalty to a linear regression model is an example of incorporating prior knowledge. What specific prior belief about the model's weights does this regularization term encode?$

Role of Prior Knowledge Medium

A.

The belief that most weights are exactly zero, leading to a sparse model.

B.

The belief that the weights should be correlated with each other.

C.

The belief that all weights should be large and positive.

D.

The belief that the weights follow a Gaussian distribution centered at zero.

38 $In a learning system for regression, changing the loss function from Mean Squared Error (MSE),, to Mean Absolute Error (MAE),, will likely have what effect on the trained model?$

Components of a Learning System Medium

A.

The model will become less sensitive to outlier data points.

B.

The model will be guaranteed to find the globally optimal solution.

C.

The model will train significantly faster on all datasets.

D.

The model's predictions will always be higher than the true values.

39 $A startup claims to have invented a 'Universal Learning Algorithm' that is guaranteed to outperform all existing algorithms on any given machine learning task. Which concept from statistical learning theory directly refutes this claim?$

No Free Lunch Theorem Medium

A.

The principle of Empirical Risk Minimization

B.

The PAC Learning Framework

C.

The No Free Lunch Theorem

D.

The concept of Inductive Bias

40 $You are given a dataset with 50,000 features and only 1,000 samples (a 'high-dimensional' or '' problem). You suspect that only a small subset of these features are actually predictive. Which model's inductive bias is best suited for this scenario?$

Choosing Algorithms Based on Data and Assumptions Medium

A.

LASSO (L1-regularized) logistic regression.

B.

Support Vector Machine with a polynomial kernel to capture complex interactions.

C.

k-Nearest Neighbors (k-NN), as it makes no assumptions about the data distribution.

D.

Principal Component Analysis (PCA) followed by a standard linear regression.

41 $Consider a binary classification problem where the hypothesis space has an infinite VC dimension. If we apply the Empirical Risk Minimization (ERM) principle on a sufficiently large, noise-free, and realizable training set, what is the most critical implication for the learned hypothesis ?$

Empirical Risk Minimization Hard

A.

The learner will fail to achieve zero empirical risk because the infinite capacity of the hypothesis space makes it impossible to find a perfect fit.

B.

The ERM principle is computationally intractable for hypothesis spaces with infinite VC dimension, so no hypothesis can be found.

C.

The learner will achieve zero empirical risk and is guaranteed to have a low true risk,, due to the realizability assumption.

D.

The learner will achieve zero empirical risk, but the generalization bound will be trivial or non-informative, leading to a high probability of overfitting.

42 $In the PAC learning framework, the sample complexity bound for a finite hypothesis class is given by . If you are required to decrease the error tolerance by a factor of 10 while simultaneously increasing the confidence from 99% to 99.9%, how does the required sample size change, approximately?$

Probably Approximately Correct Learning Hard

A.

It increases by a factor of approximately 10, but the change due to is negligible.

B.

It increases additively due to the change in, and multiplicatively by 10 due to .

C.

It increases by a factor of approximately 10.

D.

It increases by a factor greater than 10, because both terms contribute to a significant increase.

43 $An SVM with a Gaussian (RBF) kernel is trained on a dataset. Which of the following statements provides the most accurate and nuanced description of its inductive bias?$

Inductive Bias Hard

A.

A restriction bias that only allows circular or elliptical decision boundaries centered at the origin.

B.

A preference bias for smoother decision boundaries, where smoothness is defined by the kernel's bandwidth, combined with a preference for maximizing the margin in a high-dimensional feature space.

C.

A restriction bias that limits the decision boundary to be a hyperplane.

D.

A preference bias for solutions with the maximum possible margin in the original feature space.

44 $The No Free Lunch (NFL) theorem implies that no single learning algorithm is universally superior across all possible learning problems. Which of the following is a direct and subtle consequence of the NFL theorem for the practice of machine learning?$

No Free Lunch Theorem Hard

A.

It is impossible to build a general-purpose AI because no single algorithm can solve all problems.

B.

Cross-validation is a flawed technique because performance on a subset of data gives no information about performance on unseen data from a different problem.

C.

Ensemble methods like Random Forests are theoretically no better than simple methods like Logistic Regression.

D.

The success of a machine learning algorithm on a specific task is heavily dependent on the alignment of its inductive bias with the properties of the true underlying data distribution.

45 $Consider two hypothesis classes for binary classification in : is the class of all linear separators (lines), and is the class of all axis-aligned rectangles. We know that VCdim() = 3 and VCdim() = 4. In the agnostic PAC learning setting, what can we definitively conclude about the sample complexity required to learn a hypothesis from these classes?$

Sample Complexity Hard

A.

It's impossible to compare them because the sample complexity bound is only an upper bound, and the actual number of samples needed might be lower for on a specific problem.

B.

The sample complexity bounds suggest that for a sufficiently small and, will require more samples than, as the VC dimension term will dominate the bound.

C.

Learning with will always require more samples than for any given and because its VC dimension is higher.

D.

Sample complexity is identical for both because they are both PAC-learnable in polynomial time.

46 $The standard statistical learning framework relies on the assumption that training and test data are drawn IID (Independently and Identically Distributed) from a fixed but unknown distribution . What is the most severe theoretical consequence if the 'identically distributed' assumption is violated (e.g., due to covariate shift, where but is unchanged)?$

Statistical Learning Framework Hard

A.

The VC dimension of the hypothesis space becomes infinite, making learning impossible.

B.

The empirical risk ceases to be an unbiased estimator of the true risk on the test distribution, undermining the core principle of ERM.

C.

The model will always underfit the training data because it cannot account for the distribution shift.

D.

The learning algorithm will fail to converge during training.

47 $In a Bayesian learning framework, incorporating prior knowledge is done via a prior distribution over the hypothesis space . How does this relate to the concept of regularization in the ERM framework?$

Role of Prior Knowledge Hard

A.

The prior is equivalent to choosing a specific loss function (e.g., squared error or cross-entropy), while the likelihood is the regularizer.

B.

There is no relationship; Bayesian methods are probabilistic while regularization is a deterministic penalty.

C.

Regularization is a specific case of a Bayesian prior where the prior distribution is always a zero-mean Gaussian.

D.

Maximizing the posterior probability of a hypothesis,, is equivalent to minimizing a regularized loss function, where corresponds to the loss term and corresponds to the regularization term.

48 $A machine learning problem is defined by a triplet : a task, a performance measure, and experience. Consider a spam detection task where the performance measure is 'total accuracy'. If the dataset (experience) is highly imbalanced (e.g., 99% non-spam, 1% spam), why might this formulation be considered an ill-posed learning problem from a practical and theoretical standpoint?$

Well-Posed Learning Problems Hard

A.

The experience is flawed because it does not contain an equal number of positive and negative examples.

B.

The performance measure is misaligned with the true goals of the task, as a trivial classifier that always predicts 'non-spam' would achieve 99% accuracy, rendering the learning process ineffective at identifying the target class.

C.

The task of classifying emails is inherently ambiguous and thus not well-defined.

D.

The problem is ill-posed because a linear model cannot solve it, violating a necessary condition for well-posedness.

49 $You are given a dataset with a very large number of features () compared to the number of samples (), i.e., . You have a strong prior belief that the outcome is determined by a small subset of these features. Which learning principle and associated algorithm combination is most theoretically sound for this scenario?$

Choosing Algorithms Based on Data and Assumptions Hard

A.

Empirical Risk Minimization with a simple linear regression model, as it has low VC dimension and is less prone to overfitting.

B.

Structural Risk Minimization using an SVM with a low-bandwidth RBF kernel to create a complex, non-linear decision boundary.

C.

The principle of sparsity, implemented via L1 regularization (e.g., Lasso regression), which adds a penalty proportional to the sum of the absolute values of the model coefficients.

D.

The Maximum Likelihood principle with a high-capacity model like a deep neural network to explore all feature interactions.

50 $What is the fundamental difference between restriction bias (or language bias) and preference bias (or search bias)?$

Inductive Bias Hard

A.

Restriction bias controls the speed of convergence, while preference bias controls the final accuracy of the model.

B.

Restriction bias refers to the algorithm's preference for certain hypotheses, while preference bias refers to the set of hypotheses it can possibly represent.

C.

Restriction bias is the set of hypotheses a learner can possibly represent, while preference bias is the ordering or preference the learner uses to choose among hypotheses that are consistent with the training data.

D.

Restriction bias applies only to symbolic AI algorithms, while preference bias applies only to statistical machine learning algorithms.

51 $Structural Risk Minimization (SRM) is an extension of ERM that aims to prevent overfitting. It does so by minimizing, where is a complexity penalty. How does SRM relate to the PAC learning framework's generalization bounds?$

Empirical Risk Minimization Hard

A.

SRM maximizes the empirical risk while minimizing the complexity term to find the most generalizable model.

B.

SRM directly minimizes the sample complexity from the PAC framework.

C.

SRM is unrelated to PAC bounds; it is a purely heuristic approach.

D.

The SRM principle is a practical implementation of the idea behind PAC generalization bounds, which state that True Risk Empirical Risk + Complexity Term. SRM attempts to minimize this upper bound directly.

52 $The agnostic PAC learning model is considered more realistic than the original (realizable) PAC model. What is the key assumption that the agnostic model relaxes, and what is the primary consequence for the sample complexity bounds?$

Probably Approximately Correct Learning Hard

A.

It relaxes the assumption that the data is IID, leading to bounds that depend on the correlation structure.

B.

It relaxes the assumption of a finite hypothesis space, which requires replacing with the VC dimension, but the dependence on remains the same.

C.

It relaxes the assumption that the target concept is contained within the hypothesis class, and the sample complexity bounds subsequently depend on instead of .

D.

It relaxes the assumption of a binary loss function, allowing for regression, which changes the bounds to depend on the variance of the noise.

53 $In designing a learning system, the choice of the Hypothesis Space and the choice of the Loss Function are critical. How does the interaction between these two components influence the optimization problem faced by the learning algorithm?$

Components of a Learning System Hard

A.

The hypothesis space determines the computational complexity, while the loss function determines the statistical efficiency, and they do not interact.

B.

If the loss function is non-convex, the optimization is hard regardless of the hypothesis space.

C.

The combination of a complex, non-linear hypothesis space (e.g., a deep neural network) with a standard loss function (e.g., cross-entropy) can create a highly non-convex error surface with many local minima, making the optimization problem significantly harder than for a convex hypothesis space.

D.

The choice of hypothesis space dictates the optimizer (e.g., linear models require gradient descent), while the loss function is irrelevant to the optimization procedure.

54 $The sample complexity bound derived from the VC dimension for infinite hypothesis spaces,, is known to be quite loose in practice. What is a primary theoretical reason for this looseness?$

Sample Complexity Hard

A.

The bound is derived from union bounds (like the union bound over all possible dichotomies) which are notoriously pessimistic, as they assume worst-case overlaps and dependencies between events.

B.

The bound fails to account for the computational complexity of the learning algorithm.

C.

The bound is loose because it assumes the worst-case data distribution, which is rarely encountered in reality.

D.

The VC dimension is always much larger than the true complexity of the learning problem.

55 $If the No Free Lunch theorem states that all learners are equal on average over all possible target functions, why do we observe in practice that certain complex models like deep neural networks consistently outperform simpler models like logistic regression on tasks like image recognition?$

No Free Lunch Theorem Hard

A.

Deep neural networks can approximate a larger set of functions, so they are a provable exception to the No Free Lunch theorem.

B.

The No Free Lunch theorem has been disproven for high-dimensional spaces.

C.

The theorem only applies when the amount of training data is infinite.

D.

The set of 'all possible target functions' includes many highly random, noisy, or adversarial functions that do not resemble real-world problems. The distribution of real-world problems is highly concentrated, not uniform.

56 $Within the statistical learning framework, what is the conceptual difference between the 'true risk' (or generalization error) and the 'test error'?$

Statistical Learning Framework Hard

A.

They are identical concepts; both measure the error on unseen data.

B.

True risk is calculated on the training set, while test error is calculated on the test set.

C.

True risk is a theoretical expectation over the entire data distribution, which is unknown, while test error is an empirical estimate of the true risk calculated on a finite, held-out test set.

D.

True risk applies to regression problems, while test error applies to classification problems.

57 $Data augmentation, a technique commonly used in deep learning for image classification, involves creating modified copies of training images (e.g., rotating, cropping, or flipping). From a statistical learning theory perspective, this practice is best understood as:$

Role of Prior Knowledge Hard

A.

A method for increasing the VC dimension of the hypothesis space to allow for more complex decision boundaries.

B.

A form of regularization that is equivalent to adding an L2 penalty on the weights of the neural network.

C.

Explicitly encoding prior knowledge about invariances in the data distribution (e.g., an object's class is invariant to rotation), which effectively expands the training set and encourages the model to learn these invariances.

D.

A way to reduce the empirical risk to zero, guaranteeing better generalization according to PAC bounds.

58 $Tom Mitchell's definition of a well-posed learning problem requires specifying the task (T), performance measure (P), and experience (E). In the context of reinforcement learning (RL), how are these three components typically instantiated for an agent learning to play chess?$

Well-Posed Learning Problems Hard

A.

T: Generating legal chess moves. P: A binary measure of whether the agent wins or loses a game. E: Playing games against itself (self-play).

B.

T: Predicting the opponent's next move. P: Accuracy of move prediction. E: A database of grandmaster games.

C.

T: Playing chess. P: Number of games played. E: The final score of each game.

D.

T: Maximizing the cumulative reward. P: The win/loss/draw ratio over many games. E: A static dataset of chess positions labeled 'good' or 'bad'.

59 $A data scientist is working with time-series financial data to predict stock price movements. They observe that the statistical properties of the data (e.g., volatility) change over time. Which core assumption of the standard statistical learning framework is most clearly violated, and what is a common strategy to mitigate this?$

Choosing Algorithms Based on Data and Assumptions Hard

A.

The assumption of a finite VC dimension is violated; the solution is to use heavy L2 regularization.

B.

The independence assumption is violated; using a Recurrent Neural Network (RNN) can model the temporal dependencies.

C.

The realizability assumption is violated; the solution is to switch to an agnostic PAC learning algorithm.

D.

The 'identically distributed' assumption is violated (concept drift); a mitigation strategy is to use a sliding window or give more weight to recent data during training.

60 $Consider a k-Nearest Neighbors (k-NN) classifier. Which statement best characterizes its inductive bias?$

Inductive Bias Hard

A.

Its inductive bias is primarily the assumption that data points that are close in the feature space are likely to belong to the same class, which is a form of smoothness assumption.

B.

It has a strong preference bias for smooth decision boundaries, similar to an RBF SVM.

C.

It has no inductive bias, as it is a non-parametric model that makes no assumptions about the data distribution.

D.

It has a strong restriction bias, limiting it to linear decision boundaries.

Unit 6 - Practice Quiz