Unit 3 - Notes

INT255 13 min read

Unit 3: Probability Theory and Loss Functions for Machine Learning

1. Random Variables in Machine Learning Models

Definition and Role

A random variable is a variable whose value is a numerical outcome of a random phenomenon. In machine learning, we treat data points, features, model parameters, and predictions as random variables. This probabilistic view allows us to model the inherent uncertainty and noise in data and the learning process.

Inputs/Features (X): The features of a dataset are often treated as random variables. For example, the height of a person in a dataset is a random variable drawn from a distribution of human heights.
Outputs/Targets (Y): The target variable we want to predict is also a random variable. Its randomness comes from measurement noise or inherent stochasticity in the system.
Model Parameters (θ): In Bayesian models, the parameters themselves (e.g., weights in a linear regression) are treated as random variables, having their own distributions.

Types of Random Variables

Discrete Random Variable:
- Can take on a finite or countably infinite number of distinct values.
- Characterized by a Probability Mass Function (PMF), denoted P(X=x), which gives the probability of the variable taking a specific value x.
- The sum of probabilities over all possible values is 1: Σ P(x) = 1.
- ML Examples:
  - The outcome of a coin flip (Heads=1, Tails=0).
  - The class label in a classification problem (e.g., {0: Cat, 1: Dog, 2: Bird}).
  - The number of times a specific word appears in a document.
Continuous Random Variable:
- Can take on any value within a given range or interval.
- Characterized by a Probability Density Function (PDF), denoted p(x) or f(x).
- The probability of the variable falling within an interval [a, b] is the integral of the PDF over that interval: P(a ≤ X ≤ b) = ∫[a,b] p(x) dx.
- The total area under the PDF curve is 1: ∫[-∞,∞] p(x) dx = 1.
- Important Note: For a continuous variable, the probability of it taking any single specific value is zero, i.e., P(X=x) = 0. We only talk about probabilities over intervals.
- ML Examples:
  - The price of a house.
  - The temperature measurement from a sensor.
  - The weight parameters in a neural network.

2. Probability Distributions in Learning Algorithms

A probability distribution describes the likelihood of a random variable taking on each of its possible values. The choice of distribution is a critical modeling assumption in ML.

Discrete Distributions

Bernoulli Distribution:
- Description: Models a single trial with two possible outcomes (e.g., success/failure, yes/no, 1/0).
- Parameter: p (probability of success, where the outcome is 1).
- PMF: P(X=x | p) = p^x * (1-p)^(1-x) for x ∈ {0, 1}.
- ML Context: The foundation for binary classification. The output of a logistic regression model is interpreted as the parameter p of a Bernoulli distribution for the class label.
Categorical (or Multinoulli) Distribution:
- Description: A generalization of the Bernoulli distribution to a single trial with K possible outcomes.
- Parameters: A vector p = (p₁, p₂, ..., pₖ) where pₖ is the probability of the k-th outcome and Σ pₖ = 1.
- PMF: P(X=k | p) = pₖ. Often written using one-hot encoding where x is a vector with a 1 at the k-th position and 0s elsewhere: P(X=x | p) = Π pₖ^(xₖ).
- ML Context: The foundation for multi-class classification. The output of a softmax function in a neural network is interpreted as the parameter vector p for the class label.
Binomial Distribution:
- Description: Models the number of successes k in a fixed number of n independent Bernoulli trials.
- Parameters: n (number of trials), p (probability of success in each trial).
- PMF: P(X=k | n, p) = C(n, k) * p^k * (1-p)^(n-k), where C(n, k) is the binomial coefficient "n choose k".
- ML Context: Less common as a direct model output, but useful in statistical modeling and data analysis (e.g., modeling conversion rates over n website visits).

Continuous Distributions

Uniform Distribution:
- Description: All values within a given interval [a, b] are equally likely.
- Parameters: a (lower bound), b (upper bound).
- PDF:
  TEXT
```
        p(x | a, b) = 1 / (b - a)  for x in [a, b]
                      0            otherwise
        
```
- ML Context: Often used for initializing weights in neural networks when there is no prior knowledge about their values. Also used in some data generation and sampling techniques.
Gaussian (Normal) Distribution:
- Description: The classic "bell curve." It's defined by its mean (center) and variance (spread). Central Limit Theorem states that the sum of many independent random variables tends towards a Gaussian distribution.
- Parameters: μ (mean), σ² (variance).
- PDF: p(x | μ, σ²) = (1 / sqrt(2πσ²)) * exp(-(x - μ)² / (2σ²))
- ML Context:
  - Modeling Noise: The error term in linear regression is often assumed to be Gaussian.
  - Feature Distribution: Many natural phenomena follow a Gaussian distribution, making it a common assumption for data features.
  - Bayesian ML: Used as a prior distribution for model parameters, which leads to L2 regularization (see section 5).
  - Gaussian Mixture Models (GMMs): Used for clustering.
Laplace Distribution:
- Description: Similar to the Gaussian but is "pointier" at the mean and has "fatter" tails. This means it assigns higher probability to values near the mean and also to extreme outliers compared to the Gaussian.
- Parameters: μ (location, mean), b (scale, diversity).
- PDF: p(x | μ, b) = (1 / (2b)) * exp(-|x - μ| / b)
- ML Context:
  - Robustness: Its fatter tails make it more robust to outliers than the Gaussian.
  - Bayesian ML: Used as a prior distribution for model parameters, which leads to L1 regularization and sparsity (see section 5).

3. Likelihood and Maximum Likelihood Estimation (MLE)

MLE is a fundamental principle for "learning" the parameters of a model from data. It asks: "What model parameters are most likely to have generated the observed data?"

Probability vs. Likelihood

This is a subtle but crucial distinction.

Probability P(data | θ): The parameters θ are fixed. We are asking about the probability of seeing different data outcomes. The function is over the data.
Likelihood L(θ | data): The data is fixed (it's what we observed). We are asking which parameters θ make this observed data most probable. The function is over the θ.

Numerically, L(θ | data) = P(data | θ), but they are conceptually different functions.

The Principle of Maximum Likelihood

Given a dataset D = {x₁, x₂, ..., xₙ} which we assume is i.i.d. (independent and identically distributed) and drawn from a distribution P(x | θ), the likelihood of the entire dataset is the product of the probabilities of each individual data point:

L(θ | D) = P(D | θ) = Π P(xᵢ | θ)

The goal of Maximum Likelihood Estimation (MLE) is to find the parameter values θ_MLE that maximize this likelihood function.

θ_MLE = argmax_θ L(θ | D)

The Log-Likelihood Trick

Products are mathematically difficult to optimize. Since the logarithm is a monotonically increasing function, maximizing L(θ) is equivalent to maximizing log(L(θ)). This is a powerful trick.

The log-likelihood, ℓ(θ | D), turns the product into a sum:

ℓ(θ | D) = log(L(θ | D)) = log(Π P(xᵢ | θ)) = Σ log(P(xᵢ | θ))

Sums are much easier to differentiate and work with. So, we find θ_MLE by solving:

θ_MLE = argmax_θ Σ log(P(xᵢ | θ))

This is typically done by taking the derivative of the log-likelihood with respect to θ, setting it to zero, and solving for θ.

Example: MLE for a Bernoulli Distribution

Problem: We flip a coin n times and observe n_H heads and n_T tails. What is the MLE for p, the probability of heads?
Data: D consists of n_H outcomes of x=1 and n_T outcomes of x=0.
Likelihood: The PMF is P(x | p) = p^x * (1-p)^(1-x). The total likelihood is:
L(p | D) = Π [p^(xᵢ) * (1-p)^(1-xᵢ)] = p^(n_H) * (1-p)^(n_T)
Log-Likelihood:
ℓ(p | D) = log(p^(n_H) * (1-p)^(n_T)) = n_H * log(p) + n_T * log(1-p)
Optimization: Take the derivative with respect to p and set to zero.
dℓ/dp = n_H/p - n_T/(1-p) = 0
Solution: Solving for p gives:
p_MLE = n_H / (n_H + n_T)
This is the intuitive result: the best estimate for the probability of heads is the observed frequency of heads.

4. Loss Functions: Quantifying Model Error

A loss function (or cost function) L(y, ŷ) measures the penalty for a model's prediction ŷ being different from the true value y. The goal of training a machine learning model is to find parameters that minimize the average loss over the training data.

Probabilistic Connection

Many common loss functions are derived directly from the principle of Maximum Likelihood Estimation under a specific assumption about the probability distribution of the target variable. Minimizing the negative log-likelihood is equivalent to minimizing the loss function.

Squared Error Loss (L2 Loss)

Definition: L(y, ŷ) = (y - ŷ)²
Use Case: The standard loss function for regression problems.
Probabilistic Interpretation:
1. Assume the target variable y is determined by our model's prediction ŷ = f(x) plus some Gaussian noise ε.
2. y = f(x) + ε, where ε ~ N(0, σ²) (noise is normally distributed with mean 0).
3. This is equivalent to saying the probability of y given x is Gaussian: P(y | x; θ) ~ N(y | f(x; θ), σ²).
4. The PDF is: P(y | x; θ) = (1 / sqrt(2πσ²)) * exp(-(y - f(x; θ))² / (2σ²)).
5. Now, let's find the negative log-likelihood for a single data point (x, y):
  -log(P(y | x; θ)) = -log[(1 / sqrt(2πσ²))] + (y - f(x; θ))² / (2σ²)
6. To find the parameters θ that maximize the likelihood, we must minimize the negative log-likelihood. The first term and the 1/(2σ²) are constants with respect to our model's output f(x; θ).
7. Therefore, minimizing this expression is equivalent to minimizing: (y - f(x; θ))².
8. Conclusion: Minimizing the Squared Error Loss is equivalent to performing Maximum Likelihood Estimation for the model parameters under the assumption of Gaussian noise in the target variable.

Logistic Loss (Binary Cross-Entropy)

Definition: L(y, ŷ) = -[y log(ŷ) + (1-y) log(1-ŷ)]
- y is the true label (0 or 1).
- ŷ is the model's predicted probability of the class being 1.
Use Case: The standard loss function for binary classification.
Probabilistic Interpretation:
1. In binary classification, the target y is a binary outcome (0 or 1). We model this with a Bernoulli distribution.
2. Our model (e.g., logistic regression) outputs a probability ŷ. We interpret this as the parameter p of the Bernoulli distribution: P(y | x; θ) = Bernoulli(y | ŷ).
3. The PMF is: P(y | x; θ) = ŷ^y * (1-ŷ)^(1-y).
4. The log-likelihood for a single data point is:
  log(P(y | x; θ)) = y log(ŷ) + (1-y) log(1-ŷ)
5. The negative log-likelihood is:
  -log(P(y | x; θ)) = -[y log(ŷ) + (1-y) log(1-ŷ)]
6. Conclusion: The Logistic Loss function is exactly the negative log-likelihood of the data under the assumption that the labels are generated from a Bernoulli distribution whose parameter is the model's output.

Cross-Entropy Loss (Multi-class)

This is the generalization of logistic loss to K classes.

Definition: L(y, ŷ) = -Σ [yₖ log(ŷₖ)]
- y is the one-hot encoded true label vector (e.g., [0, 1, 0] for class 2).
- ŷ is the vector of predicted probabilities from a softmax function.
Probabilistic Interpretation: This is the negative log-likelihood of the data under the assumption that the labels are generated from a Categorical distribution, where the parameter vector p is the model's softmax output vector ŷ.

5. Bayesian Interpretation of Machine Learning

The Bayesian approach provides a powerful framework for reasoning about model uncertainty and offers a deep justification for regularization.

Frequentist vs. Bayesian View

Frequentist (MLE): Model parameters θ are fixed, unknown constants. We find a single best point estimate θ_MLE that maximizes the data likelihood.
Bayesian: Model parameters θ are random variables. We start with a prior belief about θ and update it using data to get a posterior distribution over θ. The result is not a single best value, but a full probability distribution for the parameters.

Bayes' Theorem

This theorem is the engine of Bayesian inference. It tells us how to update our beliefs in light of new evidence.

P(θ | D) = [P(D | θ) * P(θ)] / P(D)

P(θ | D) - Posterior: The probability distribution of the parameters θ after observing the data D. This is our updated belief.
P(D | θ) - Likelihood: The same likelihood function used in MLE. It represents the probability of the data given a specific choice of parameters.
P(θ) - Prior: The probability distribution of the parameters θ before observing any data. This encodes our prior beliefs or assumptions (e.g., that weights should be small).
P(D) - Evidence (or Marginal Likelihood): The probability of the data, averaged over all possible parameter values: P(D) = ∫ P(D | θ) P(θ) dθ. It acts as a normalization constant.

Maximum A Posteriori (MAP) Estimation

Finding the full posterior distribution P(θ | D) can be computationally expensive. A simpler alternative is to find the single most probable parameter setting, similar to MLE. This is called Maximum A Posteriori (MAP) estimation.

θ_MAP = argmax_θ P(θ | D)

Since the evidence P(D) is constant with respect to θ, we can ignore it for optimization:

θ_MAP = argmax_θ [P(D | θ) * P(θ)]

Taking the log, as we did for MLE:

θ_MAP = argmax_θ [log(P(D | θ)) + log(P(θ))]
θ_MAP = argmax_θ [log-likelihood + log-prior]

This reveals a profound connection: MAP estimation is equivalent to maximizing the log-likelihood plus a term from the log of the prior.

The Crucial Link: MAP and Regularization

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages complex models (e.g., large parameter weights). MAP provides a probabilistic justification for this.

Objective Function = Loss Function + Regularization Term
-log(Posterior) = -log(Likelihood) + [-log(Prior)]

The negative log-prior acts as the regularization term!

Derivation: Gaussian Prior leads to L2 Regularization (Ridge)

Assume a Gaussian Prior on the model weights w: Let's believe, before seeing any data, that the weights are likely to be small and centered around zero. We can model this with a zero-mean Gaussian prior:
P(w) ~ N(0, τ²I), where τ² controls the width of the prior (how much we believe weights should be spread out).
Calculate the log-prior:
log(P(w)) = log [ (1/sqrt(2πτ²)) * exp(-||w||² / (2τ²)) ]
log(P(w)) = C - (||w||² / (2τ²))
where C is a constant and ||w||² is the squared L2-norm (Σ wᵢ²).
Formulate the MAP objective:
w_MAP = argmax_w [Σ log(P(yᵢ | xᵢ, w)) + log(P(w))]
To turn this into a minimization problem (like minimizing loss), we take the negative:
w_MAP = argmin_w [-Σ log(P(yᵢ | xᵢ, w)) - log(P(w))]
w_MAP = argmin_w [Loss_MLE - (C - ||w||² / (2τ²))]
Simplify: Dropping constants, the objective becomes:
w_MAP = argmin_w [Loss_MLE + (1 / (2τ²)) * ||w||²]
w_MAP = argmin_w [Loss_MLE + λ * ||w||²]
where λ = 1 / (2τ²).
Conclusion: This is exactly the objective function for L2 Regularization (Ridge Regression). The regularization strength λ is inversely proportional to the variance τ² of our prior. A strong prior belief that weights are near zero (small τ²) leads to strong regularization (large λ).

Intuition: Laplace Prior leads to L1 Regularization (Lasso)

A similar derivation can be done for the Laplace prior.

Assume a Laplace Prior on the weights w: P(w) ~ Laplace(0, b).
The PDF involves exp(-|w| / b).
Calculate the log-prior: The log-prior will be proportional to -|w|, which is the L1-norm.
Formulate the MAP objective: Minimizing the negative log-posterior becomes:
w_MAP = argmin_w [Loss_MLE + λ * ||w||₁]
Conclusion: This is the objective for L1 Regularization (Lasso Regression). The "pointy" shape of the Laplace distribution's PDF at zero encourages many weights to become exactly zero, leading to sparse models.

Unit 2

Unit 4