Practice MCQ

Unit 1 - Notes

INT394 8 min read

Unit 1: Foundations of Machine Learning

1. Introduction to Machine Learning

1.1 Definition

Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on building systems that learn from data, identify patterns, and make decisions with minimal human intervention.

Standard Definitions:

Arthur Samuel (1959): "Field of study that gives computers the ability to learn without being explicitly programmed."
Tom Mitchell (1997) - The Formal Definition:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
- Example (Spam Filter):
  - Task (T): Classifying emails as spam or not spam.
  - Experience (E): Watching you label emails as spam or ham.
  - Performance (P): The percentage of emails correctly classified.

1.2 Need for Machine Learning

Traditional programming relies on hard-coded rules (e.g., if-else logic). This approach fails when:

Tasks are too complex: Example: Face recognition (cannot write rules for every pixel variation).
Environments change: Example: Stock market prediction or malware detection (patterns evolve).
Human expertise is absent: Example: Analysis of DNA sequences where patterns are unknown to humans.
Personalization is required: Example: Netflix/Amazon recommendation systems unique to every user.

1.3 Scope of Machine Learning

Computer Vision: Image classification, object detection, facial recognition.
Natural Language Processing (NLP): Translation, sentiment analysis, chatbots (LLMs).
Healthcare: Disease diagnosis, drug discovery, personalized medicine.
Finance: Fraud detection, credit scoring, algorithmic trading.
Robotics: Autonomous navigation, manipulation tasks.

2. Paradigms of Machine Learning

2.1 Supervised Learning

The algorithm learns from a labeled dataset. The model is trained on input-output pairs $(x, y)$ to learn a mapping function $f: x \to y$ .

Goal: Predict the output for new, unseen inputs.
Types:
1. Regression: The output variable is continuous (numerical).
  - Example: Predicting house prices based on square footage.
2. Classification: The output variable is categorical (discrete).
  - Example: Classifying a tumor as malignant or benign (Binary), or classifying handwritten digits 0-9 (Multiclass).
Common Algorithms: Linear Regression, Logistic Regression, Support Vector Machines (SVM), Decision Trees, Neural Networks.

2.2 Unsupervised Learning

The algorithm learns from an unlabeled dataset. The system tries to learn the underlying structure or distribution of the data without specific guidance.

Goal: Discover hidden patterns or groupings in data.
Types:
1. Clustering: Grouping similar data points together.
  - Example: Customer segmentation based on purchasing behavior.
2. Dimensionality Reduction: Reducing the number of input variables while retaining essential information.
  - Example: PCA (Principal Component Analysis) for data visualization.
3. Association: Discovering rules that describe large portions of data.
  - Example: Market Basket Analysis ("People who buy bread also buy butter").
Common Algorithms: K-Means Clustering, Hierarchical Clustering, PCA, Apriori.

2.3 Reinforcement Learning (RL)

An agent interacts with an environment to achieve a goal. The agent learns by trial and error, receiving feedback in the form of rewards or penalties.

Components: Agent, Environment, State, Action, Reward.
Goal: Maximize the cumulative reward over time (Policy Optimization).
Example: A chess engine playing against itself; a robot learning to walk.
Distinction: Unlike supervised learning, feedback is delayed (you might not know a move was bad until you lose the game 50 moves later).

3. Challenges in Machine Learning

Insufficient Quantity of Training Data: Complex problems require thousands or millions of examples.
Non-Representative Training Data: If the training data does not reflect the real world, sampling bias occurs.
Poor Quality Data: Errors, outliers, and noise in data make it hard for the model to detect the underlying pattern.
Irrelevant Features: "Garbage in, garbage out." Success depends on Feature Engineering (selecting the right inputs).
Overfitting: The model learns the training data too well, including the noise, and fails to generalize to new data. (Low bias, High variance).
Underfitting: The model is too simple to capture the underlying structure of the data. (High bias, Low variance).
Curse of Dimensionality: As the number of features (dimensions) grows, the amount of data needed to generalize accurately grows exponentially.

4. Statistical Learning Framework

To study ML theoretically, we formalize the learning problem using statistical concepts.

4.1 Key Components

Domain Set ( $\mathcal{X}$ ): The set of all possible inputs (e.g., all possible 28x28 pixel images).
Label Set ( $\mathcal{Y}$ ): The set of all possible outcomes (e.g., $\{0, 1\}$ for binary classification).
Training Data ( $S$ ): A finite sequence of pairs $\{(x_1, y_1), \dots, (x_m, y_m)\}$ .
Data Distribution ( $\mathcal{D}$ ): An unknown probability distribution over $\mathcal{X}$ . We assume data is generated i.i.d. (independently and identically distributed) from $\mathcal{D}$ .
Labeling Function ( $f$ ): The true function $f: \mathcal{X} \to \mathcal{Y}$ such that $y_i = f(x_i)$ . (In probabilistic settings, this is $P(y|x)$ ).

4.2 The Goal

The goal is to find a hypothesis $h$ (a function $h: \mathcal{X} \to \mathcal{Y}$ ) that minimizes the error on new data.

4.3 Loss Function

A measure of the difference between the predicted value and the actual value.

L(h(x), y)

0-1 Loss (Classification): $L = 1$ if $h(x) \neq y$ , else $0$.
Squared Loss (Regression): $L = (h(x) - y)^2$ .

4.4 True Risk (Generalization Error)

The probability that the hypothesis $h$ will make an error on a random sample drawn from distribution $\mathcal{D}$ . This is what we want to minimize.

R(h) = \mathbb{E}_{x \sim \mathcal{D}}[L(h(x), f(x))]

Problem: We cannot calculate this because we do not know

\mathcal{D}

5. Empirical Risk Minimization (ERM) and Inductive Bias

Since we cannot calculate the True Risk (we don't see all future data), we use the training data as a proxy.

5.1 Empirical Risk Minimization (ERM)

Empirical risk is the average error calculated over the training set $S$ .

\hat{R}_S(h) = \frac{1}{m} \sum_{i=1}^{m} L(h(x_i), y_i)

The ERM Principle: Since we cannot minimize the True Risk directly, we search for the hypothesis $h$ that minimizes the Empirical Risk $\hat{R}_S(h)$ .

5.2 The Problem with Pure ERM (Overfitting)

If we consider every possible function map, we can find a hypothesis that memorizes the training data perfectly (Empirical Risk = 0) but fails completely on new data (True Risk is high). This is Overfitting.

5.3 Inductive Bias

To prevent overfitting, we must restrict the set of functions the learner is allowed to choose from. This restriction is called Inductive Bias.

Hypothesis Class ( $\mathcal{H}$ ): We restrict our search to a specific class of functions (e.g., "only linear functions" or "only decision trees of depth 5").
Definition: Inductive bias refers to the set of assumptions the learner makes to predict outputs for inputs it has not encountered.
The ERM learner with Inductive Bias:
$h_{ERM} = \underset{h \in \mathcal{H}}{\text{argmin}} \ \hat{R}_S(h)$

Types of Bias:

Restriction Bias: Limiting $\mathcal{H}$ (e.g., assuming the data is linearly separable).
Preference Bias: Preferring simpler hypotheses over complex ones (Occam's Razor) within $\mathcal{H}$ .

6. Probably Approximately Correct (PAC) Learning

PAC Learning is a theoretical framework introduced by Leslie Valiant (1984) to define what it means for a learning algorithm to "learn" effectively.

6.1 The Intuition

We cannot expect a learner to produce a hypothesis with zero error (approximately correct) with 100% certainty (probably) because:

The training sample $S$ is a random subset (we might get unlucky with bad data).
We generally don't see the whole universe of data.

Therefore, we demand that the learner creates a hypothesis that is Probably (with high probability $1-\delta$ ) Approximately Correct (with error less than $\epsilon$ ).

6.2 Formal Definition

A hypothesis class $\mathcal{H}$ is PAC-learnable if there exists an algorithm $A$ and a polynomial function $m_{poly}(\dots)$ such that:

For every $\epsilon > 0$ (accuracy parameter) and $\delta > 0$ (confidence parameter), for every distribution $\mathcal{D}$ , and for every target concept $f$ :

If the algorithm is given $m$ samples where $m \ge m_{poly}(1/\epsilon, 1/\delta)$ , the algorithm produces a hypothesis $h \in \mathcal{H}$ such that:

P(\text{True Risk}(h) \le \epsilon) \ge 1 - \delta

6.3 Key Parameters

Error ( $\epsilon$ ): How far off the hypothesis is allowed to be from the truth. If $\epsilon = 0.05$ , we accept 95% accuracy.
Confidence ( $1 - \delta$ ): How sure we are that the hypothesis falls within that error margin. If $\delta = 0.01$ , we are 99% sure.
Sample Complexity ( $m$ ): The number of training examples required to guarantee the PAC bounds.

6.4 Implication

PAC learning provides the mathematical justification for Machine Learning. It tells us that if we restrict the Hypothesis Class (Inductive Bias) and have enough data (Sample Complexity), we can guarantee generalization.

Sample Complexity for Finite Hypothesis Classes:
For a finite class $\mathcal{H}$ , the number of samples $m$ required to satisfy PAC learning is:

m \ge \frac{1}{\epsilon} \left( \ln|\mathcal{H}| + \ln\frac{1}{\delta} \right)

This formula shows that more complex models (larger $|\mathcal{H}|$ ) require more data.
Higher accuracy (lower $\epsilon$ ) requires linearly more data.
Higher confidence (lower $\delta$ ) requires logarithmically more data.

Unit 2