Practice MCQ

Unit 6 - Notes

CSE273 7 min read

Unit 6: Statistical Learning Theory

1. Well-Posed Learning Problems

A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ , if its performance at tasks in $T$ , as measured by $P$ , improves with experience $E$ .

Components of the Definition (The T-P-E Framework)

Task ( $T$ ): The specific behavior or function the system must perform (e.g., recognizing faces, playing chess, filtering spam).
Performance Measure ( $P$ ): A quantitative metric used to evaluate how well the system performs the task (e.g., accuracy, mean squared error, win rate).
Experience ( $E$ ): The data or interactions the system uses to learn (e.g., labeled image datasets, history of past games).

Example: Spam Filtering

Task ( $T$ ): Classify emails as "Spam" or "Not Spam."
Performance ( $P$ ): Percentage of emails correctly classified.
Experience ( $E$ ): A database of emails labeled by humans.

2. Components of a Learning System

Designing a learning system requires distinct functional modules that interact to convert raw data into a refined hypothesis.

Key Functional Components

The Critic (Evaluator): Compares the output of the learner against the ground truth or a performance standard. It produces an error signal or reward.
The Learner (Performance Element): The core algorithm that estimates the target function. It takes feedback from the critic to adjust internal parameters.
The Generalizer (Hypothesis Generator): Takes specific training examples and outputs a hypothesis that covers unseen cases.
The Experiment Generator: (In active learning) Decides what new example the system should investigate next to maximize learning.

A detailed block diagram illustrating the functional design of a Machine Learning System. Central bo... — AI-generated image — may contain inaccuracies

3. Statistical Learning Framework

Statistical Learning Theory (SLT) provides the mathematical foundation for analyzing machine learning algorithms. It formalizes the problem of inference from data.

Formal Definitions

Input Space ( $\mathcal{X}$ ): The set of all possible inputs (feature vectors).
Output Space ( $\mathcal{Y}$ ): The set of all possible outputs (labels).
Unknown Distribution ( $\mathcal{D}$ ): A fixed but unknown probability distribution over $\mathcal{X} \times \mathcal{Y}$ . We assume data is generated i.i.d (independent and identically distributed) from $\mathcal{D}$ .
Target Function ( $f$ ): The ideal function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that we want to approximate.
Hypothesis Space ( $\mathcal{H}$ ): The set of all functions the algorithm can select from (e.g., the set of all linear classifiers).
Loss Function ( $\ell$ ): A function $\ell(y, \hat{y})$ that measures the penalty for predicting $\hat{y}$ when the true label is $y$ .

The Goal

The goal is to find a hypothesis $h \in \mathcal{H}$ that minimizes the Risk (Expected Error) over the unknown distribution:
$R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}} [\ell(y, h(x))]$

4. Empirical Risk Minimization (ERM)

Since we do not know the distribution $\mathcal{D}$ , we cannot calculate the true risk $R(h)$ . Instead, we rely on the training data.

Empirical Risk

Given a training set $S = \{(x_1, y_1), ..., (x_m, y_m)\}$ , the empirical risk is the average error on this specific sample:
$\hat{R}_S(h) = \frac{1}{m} \sum_{i=1}^{m} \ell(y_i, h(x_i))$

The ERM Principle

The principle of ERM states that the learning algorithm should choose the hypothesis $\hat{h}$ that minimizes the empirical risk:
$\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$

Overfitting and Generalization

Minimizing empirical risk blindly can lead to overfitting. A model might memorize the noise in the training set ( $\hat{R}_S(h) \approx 0$ ) but fail on new data ( $R(h)$ is high). This difference is the Generalization Gap.

To prevent this, we often use Structural Risk Minimization (SRM), which adds a regularization term to penalize complexity.

A line graph visualizing the concept of Overfitting and the Generalization Gap. The X-axis represent... — AI-generated image — may contain inaccuracies

5. Inductive Bias

Inductive bias refers to the set of assumptions that the learner uses to predict outputs for inputs that it has not encountered.

Why is it Necessary?

Without inductive bias, a learner cannot generalize. If a learner makes no assumptions, it can only categorize data it has already seen (rote learning). This is often summarized as: "Bias is required for generalization."

Types of Inductive Bias

Restriction Bias (Language Bias):
- Limits the set of hypotheses considered ( $\mathcal{H}$ ).
- Example: Assuming the decision boundary is linear (ignoring all non-linear possibilities).
Preference Bias (Search Bias):
- Determines how the algorithm searches through $\mathcal{H}$ . It prefers certain hypotheses over others within the same space.
- Example: Occam's Razor (preferring simpler trees in Decision Tree learning) or Gradient Descent (preferring the nearest local minimum).

6. Role of Prior Knowledge

Prior knowledge complements data. In statistical learning, prior knowledge is injected into the system via:

Choice of Hypothesis Space: We choose Neural Networks for images because we believe spatial hierarchy matters.
Regularization: We impose constraints (e.g., L2 regularization) because we assume smoother functions are more likely to be true.
Bayesian Priors: Explicitly assigning probabilities to hypotheses before seeing data ( $P(h)$ ).

7. Probably Approximately Correct (PAC) Learning

PAC Learning is a framework proposed by Leslie Valiant to mathematically analyze the feasibility of learning. It answers: Under what conditions is learning possible?

The Definition

A concept class $C$ is PAC-learnable if there exists an algorithm $A$ such that:
For any $\epsilon > 0$ (error parameter) and $\delta > 0$ (confidence parameter), and for any distribution $\mathcal{D}$ , the algorithm produces a hypothesis $h$ such that:
$P(Error(h) \le \epsilon) \ge 1 - \delta$
using a sample size polynomial in $1/\epsilon$ and $1/\delta$ .

Interpretation

Probably ( $1-\delta$ ): The algorithm is usually successful (high confidence).
Approximately ( $\epsilon$ ): The error is small (high accuracy).
We cannot guarantee 0 error or 100% success because the training sample might be unrepresentative (e.g., drawing 100 "heads" in a row from a fair coin).

A set theory/Venn diagram visualization of PAC Learning. A large circle represents the "Hypothesis S... — AI-generated image — may contain inaccuracies

8. Sample Complexity

Sample complexity refers to the number of training examples $N$ required to learn a target function to a desired level of accuracy ( $\epsilon$ ) and confidence ( $\delta$ ).

Finite Hypothesis Space

For a finite hypothesis space $|\mathcal{H}|$ , the sample complexity to guarantee a consistent learner is PAC is:
$N \ge \frac{1}{\epsilon} \left( \ln|\mathcal{H}| + \ln\frac{1}{\delta} \right)$

This shows that the number of samples grows linearly with the log of the hypothesis space size.

Infinite Hypothesis Space (VC Dimension)

If $\mathcal{H}$ is infinite (e.g., neural networks), we use the Vapnik-Chervonenkis (VC) Dimension. The VC dimension measures the capacity (complexity) of the hypothesis space.
$N \approx O\left(\frac{VC(\mathcal{H})}{\epsilon}\right)$

Higher complexity models (high VC dim) require more data to generalize.

9. No Free Lunch Theorem (NFL)

The No Free Lunch theorem (Wolpert, 1996) states that: Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points.

Implications

No Universal Best Algorithm: There is no single algorithm (e.g., Random Forest, Deep Learning) that works best on every problem.
Assumptions Matter: An algorithm performs well on a specific task only because its inductive bias matches the properties of that specific task.
Random Guessing: Averaged over all possible problems (including completely random noise), a sophisticated algorithm performs no better than random guessing. We succeed in the real world because real-world problems are not random; they have structure.

10. Choosing Algorithms Based on Data and Assumptions

Because of the NFL theorem, algorithm selection must be based on the characteristics of the data and domain knowledge.

Factor	Guideline for Algorithm Selection
High Bias vs. High Variance	Use simpler models (Linear Regression, Naive Bayes) for small data to avoid variance. Use complex models (Deep Learning, Boosted Trees) for massive data to reduce bias.
Interpretability	If logic must be explained (e.g., medical, legal), use Decision Trees or Linear Models over Black-box Neural Networks.
Dimensionality	High dimensional sparse data (text) often works well with Linear SVMs. Low dimensional dense data works well with KNN or Neural Nets.
Prior Knowledge	If valid assumptions exist (e.g., image data has spatial locality), use algorithms with matching bias (e.g., Convolutional Neural Networks).

The Occam's Razor Principle

When choosing between two hypotheses that explain the data equally well, choose the simpler one. Simpler models are less likely to overfit (lower VC dimension) and require less sample complexity.

Unit 5