Unit 3 - Notes
Unit 3: Probability Theory and Loss Functions for Machine Learning
1. Random Variables in Machine Learning Models
Definition and Role
A random variable is a variable whose value is a numerical outcome of a random phenomenon. In machine learning, we treat data points, features, model parameters, and predictions as random variables. This probabilistic view allows us to model the inherent uncertainty and noise in data and the learning process.
- Inputs/Features (X): The features of a dataset are often treated as random variables. For example, the height of a person in a dataset is a random variable drawn from a distribution of human heights.
- Outputs/Targets (Y): The target variable we want to predict is also a random variable. Its randomness comes from measurement noise or inherent stochasticity in the system.
- Model Parameters (θ): In Bayesian models, the parameters themselves (e.g., weights in a linear regression) are treated as random variables, having their own distributions.
Types of Random Variables
-
Discrete Random Variable:
- Can take on a finite or countably infinite number of distinct values.
- Characterized by a Probability Mass Function (PMF), denoted
P(X=x), which gives the probability of the variable taking a specific valuex. - The sum of probabilities over all possible values is 1:
Σ P(x) = 1. - ML Examples:
- The outcome of a coin flip (Heads=1, Tails=0).
- The class label in a classification problem (e.g., {0: Cat, 1: Dog, 2: Bird}).
- The number of times a specific word appears in a document.
-
Continuous Random Variable:
- Can take on any value within a given range or interval.
- Characterized by a Probability Density Function (PDF), denoted
p(x)orf(x). - The probability of the variable falling within an interval
[a, b]is the integral of the PDF over that interval:P(a ≤ X ≤ b) = ∫[a,b] p(x) dx. - The total area under the PDF curve is 1:
∫[-∞,∞] p(x) dx = 1. - Important Note: For a continuous variable, the probability of it taking any single specific value is zero, i.e.,
P(X=x) = 0. We only talk about probabilities over intervals. - ML Examples:
- The price of a house.
- The temperature measurement from a sensor.
- The weight parameters in a neural network.
2. Probability Distributions in Learning Algorithms
A probability distribution describes the likelihood of a random variable taking on each of its possible values. The choice of distribution is a critical modeling assumption in ML.
Discrete Distributions
-
Bernoulli Distribution:
- Description: Models a single trial with two possible outcomes (e.g., success/failure, yes/no, 1/0).
- Parameter:
p(probability of success, where the outcome is 1). - PMF:
P(X=x | p) = p^x * (1-p)^(1-x)forx ∈ {0, 1}. - ML Context: The foundation for binary classification. The output of a logistic regression model is interpreted as the parameter
pof a Bernoulli distribution for the class label.
-
Categorical (or Multinoulli) Distribution:
- Description: A generalization of the Bernoulli distribution to a single trial with
Kpossible outcomes. - Parameters: A vector
p = (p₁, p₂, ..., pₖ)wherepₖis the probability of the k-th outcome andΣ pₖ = 1. - PMF:
P(X=k | p) = pₖ. Often written using one-hot encoding wherexis a vector with a 1 at the k-th position and 0s elsewhere:P(X=x | p) = Π pₖ^(xₖ). - ML Context: The foundation for multi-class classification. The output of a softmax function in a neural network is interpreted as the parameter vector
pfor the class label.
- Description: A generalization of the Bernoulli distribution to a single trial with
-
Binomial Distribution:
- Description: Models the number of successes
kin a fixed number ofnindependent Bernoulli trials. - Parameters:
n(number of trials),p(probability of success in each trial). - PMF:
P(X=k | n, p) = C(n, k) * p^k * (1-p)^(n-k), whereC(n, k)is the binomial coefficient "n choose k". - ML Context: Less common as a direct model output, but useful in statistical modeling and data analysis (e.g., modeling conversion rates over
nwebsite visits).
- Description: Models the number of successes
Continuous Distributions
-
Uniform Distribution:
- Description: All values within a given interval
[a, b]are equally likely. - Parameters:
a(lower bound),b(upper bound). - PDF:
TEXTp(x | a, b) = 1 / (b - a) for x in [a, b] 0 otherwise - ML Context: Often used for initializing weights in neural networks when there is no prior knowledge about their values. Also used in some data generation and sampling techniques.
- Description: All values within a given interval
-
Gaussian (Normal) Distribution:
- Description: The classic "bell curve." It's defined by its mean (center) and variance (spread). Central Limit Theorem states that the sum of many independent random variables tends towards a Gaussian distribution.
- Parameters:
μ(mean),σ²(variance). - PDF:
p(x | μ, σ²) = (1 / sqrt(2πσ²)) * exp(-(x - μ)² / (2σ²)) - ML Context:
- Modeling Noise: The error term in linear regression is often assumed to be Gaussian.
- Feature Distribution: Many natural phenomena follow a Gaussian distribution, making it a common assumption for data features.
- Bayesian ML: Used as a prior distribution for model parameters, which leads to L2 regularization (see section 5).
- Gaussian Mixture Models (GMMs): Used for clustering.
-
Laplace Distribution:
- Description: Similar to the Gaussian but is "pointier" at the mean and has "fatter" tails. This means it assigns higher probability to values near the mean and also to extreme outliers compared to the Gaussian.
- Parameters:
μ(location, mean),b(scale, diversity). - PDF:
p(x | μ, b) = (1 / (2b)) * exp(-|x - μ| / b) - ML Context:
- Robustness: Its fatter tails make it more robust to outliers than the Gaussian.
- Bayesian ML: Used as a prior distribution for model parameters, which leads to L1 regularization and sparsity (see section 5).
3. Likelihood and Maximum Likelihood Estimation (MLE)
MLE is a fundamental principle for "learning" the parameters of a model from data. It asks: "What model parameters are most likely to have generated the observed data?"
Probability vs. Likelihood
This is a subtle but crucial distinction.
- Probability
P(data | θ): The parametersθare fixed. We are asking about the probability of seeing different data outcomes. The function is over thedata. - Likelihood
L(θ | data): Thedatais fixed (it's what we observed). We are asking which parametersθmake this observed data most probable. The function is over theθ.
Numerically, L(θ | data) = P(data | θ), but they are conceptually different functions.
The Principle of Maximum Likelihood
Given a dataset D = {x₁, x₂, ..., xₙ} which we assume is i.i.d. (independent and identically distributed) and drawn from a distribution P(x | θ), the likelihood of the entire dataset is the product of the probabilities of each individual data point:
L(θ | D) = P(D | θ) = Π P(xᵢ | θ)
The goal of Maximum Likelihood Estimation (MLE) is to find the parameter values θ_MLE that maximize this likelihood function.
θ_MLE = argmax_θ L(θ | D)
The Log-Likelihood Trick
Products are mathematically difficult to optimize. Since the logarithm is a monotonically increasing function, maximizing L(θ) is equivalent to maximizing log(L(θ)). This is a powerful trick.
The log-likelihood, ℓ(θ | D), turns the product into a sum:
ℓ(θ | D) = log(L(θ | D)) = log(Π P(xᵢ | θ)) = Σ log(P(xᵢ | θ))
Sums are much easier to differentiate and work with. So, we find θ_MLE by solving:
θ_MLE = argmax_θ Σ log(P(xᵢ | θ))
This is typically done by taking the derivative of the log-likelihood with respect to θ, setting it to zero, and solving for θ.
Example: MLE for a Bernoulli Distribution
- Problem: We flip a coin
ntimes and observen_Hheads andn_Ttails. What is the MLE forp, the probability of heads? - Data:
Dconsists ofn_Houtcomes ofx=1andn_Toutcomes ofx=0. - Likelihood: The PMF is
P(x | p) = p^x * (1-p)^(1-x). The total likelihood is:
L(p | D) = Π [p^(xᵢ) * (1-p)^(1-xᵢ)] = p^(n_H) * (1-p)^(n_T) - Log-Likelihood:
ℓ(p | D) = log(p^(n_H) * (1-p)^(n_T)) = n_H * log(p) + n_T * log(1-p) - Optimization: Take the derivative with respect to
pand set to zero.
dℓ/dp = n_H/p - n_T/(1-p) = 0 - Solution: Solving for
pgives:
p_MLE = n_H / (n_H + n_T)
This is the intuitive result: the best estimate for the probability of heads is the observed frequency of heads.
4. Loss Functions: Quantifying Model Error
A loss function (or cost function) L(y, ŷ) measures the penalty for a model's prediction ŷ being different from the true value y. The goal of training a machine learning model is to find parameters that minimize the average loss over the training data.
Probabilistic Connection
Many common loss functions are derived directly from the principle of Maximum Likelihood Estimation under a specific assumption about the probability distribution of the target variable. Minimizing the negative log-likelihood is equivalent to minimizing the loss function.
Squared Error Loss (L2 Loss)
- Definition:
L(y, ŷ) = (y - ŷ)² - Use Case: The standard loss function for regression problems.
- Probabilistic Interpretation:
- Assume the target variable
yis determined by our model's predictionŷ = f(x)plus some Gaussian noiseε. y = f(x) + ε, whereε ~ N(0, σ²)(noise is normally distributed with mean 0).- This is equivalent to saying the probability of
ygivenxis Gaussian:P(y | x; θ) ~ N(y | f(x; θ), σ²). - The PDF is:
P(y | x; θ) = (1 / sqrt(2πσ²)) * exp(-(y - f(x; θ))² / (2σ²)). - Now, let's find the negative log-likelihood for a single data point
(x, y):
-log(P(y | x; θ)) = -log[(1 / sqrt(2πσ²))] + (y - f(x; θ))² / (2σ²) - To find the parameters
θthat maximize the likelihood, we must minimize the negative log-likelihood. The first term and the1/(2σ²)are constants with respect to our model's outputf(x; θ). - Therefore, minimizing this expression is equivalent to minimizing:
(y - f(x; θ))². - Conclusion: Minimizing the Squared Error Loss is equivalent to performing Maximum Likelihood Estimation for the model parameters under the assumption of Gaussian noise in the target variable.
- Assume the target variable
Logistic Loss (Binary Cross-Entropy)
- Definition:
L(y, ŷ) = -[y log(ŷ) + (1-y) log(1-ŷ)]yis the true label (0 or 1).ŷis the model's predicted probability of the class being 1.
- Use Case: The standard loss function for binary classification.
- Probabilistic Interpretation:
- In binary classification, the target
yis a binary outcome (0 or 1). We model this with a Bernoulli distribution. - Our model (e.g., logistic regression) outputs a probability
ŷ. We interpret this as the parameterpof the Bernoulli distribution:P(y | x; θ) = Bernoulli(y | ŷ). - The PMF is:
P(y | x; θ) = ŷ^y * (1-ŷ)^(1-y). - The log-likelihood for a single data point is:
log(P(y | x; θ)) = y log(ŷ) + (1-y) log(1-ŷ) - The negative log-likelihood is:
-log(P(y | x; θ)) = -[y log(ŷ) + (1-y) log(1-ŷ)] - Conclusion: The Logistic Loss function is exactly the negative log-likelihood of the data under the assumption that the labels are generated from a Bernoulli distribution whose parameter is the model's output.
- In binary classification, the target
Cross-Entropy Loss (Multi-class)
This is the generalization of logistic loss to K classes.
- Definition:
L(y, ŷ) = -Σ [yₖ log(ŷₖ)]yis the one-hot encoded true label vector (e.g.,[0, 1, 0]for class 2).ŷis the vector of predicted probabilities from a softmax function.
- Probabilistic Interpretation: This is the negative log-likelihood of the data under the assumption that the labels are generated from a Categorical distribution, where the parameter vector
pis the model's softmax output vectorŷ.
5. Bayesian Interpretation of Machine Learning
The Bayesian approach provides a powerful framework for reasoning about model uncertainty and offers a deep justification for regularization.
Frequentist vs. Bayesian View
- Frequentist (MLE): Model parameters
θare fixed, unknown constants. We find a single best point estimateθ_MLEthat maximizes the data likelihood. - Bayesian: Model parameters
θare random variables. We start with a prior belief aboutθand update it using data to get a posterior distribution overθ. The result is not a single best value, but a full probability distribution for the parameters.
Bayes' Theorem
This theorem is the engine of Bayesian inference. It tells us how to update our beliefs in light of new evidence.
P(θ | D) = [P(D | θ) * P(θ)] / P(D)
P(θ | D)- Posterior: The probability distribution of the parametersθafter observing the dataD. This is our updated belief.P(D | θ)- Likelihood: The same likelihood function used in MLE. It represents the probability of the data given a specific choice of parameters.P(θ)- Prior: The probability distribution of the parametersθbefore observing any data. This encodes our prior beliefs or assumptions (e.g., that weights should be small).P(D)- Evidence (or Marginal Likelihood): The probability of the data, averaged over all possible parameter values:P(D) = ∫ P(D | θ) P(θ) dθ. It acts as a normalization constant.
Maximum A Posteriori (MAP) Estimation
Finding the full posterior distribution P(θ | D) can be computationally expensive. A simpler alternative is to find the single most probable parameter setting, similar to MLE. This is called Maximum A Posteriori (MAP) estimation.
θ_MAP = argmax_θ P(θ | D)
Since the evidence P(D) is constant with respect to θ, we can ignore it for optimization:
θ_MAP = argmax_θ [P(D | θ) * P(θ)]
Taking the log, as we did for MLE:
θ_MAP = argmax_θ [log(P(D | θ)) + log(P(θ))]
θ_MAP = argmax_θ [log-likelihood + log-prior]
This reveals a profound connection: MAP estimation is equivalent to maximizing the log-likelihood plus a term from the log of the prior.
The Crucial Link: MAP and Regularization
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages complex models (e.g., large parameter weights). MAP provides a probabilistic justification for this.
Objective Function = Loss Function + Regularization Term
-log(Posterior) = -log(Likelihood) + [-log(Prior)]
The negative log-prior acts as the regularization term!
Derivation: Gaussian Prior leads to L2 Regularization (Ridge)
- Assume a Gaussian Prior on the model weights
w: Let's believe, before seeing any data, that the weights are likely to be small and centered around zero. We can model this with a zero-mean Gaussian prior:
P(w) ~ N(0, τ²I), whereτ²controls the width of the prior (how much we believe weights should be spread out). - Calculate the log-prior:
log(P(w)) = log [ (1/sqrt(2πτ²)) * exp(-||w||² / (2τ²)) ]
log(P(w)) = C - (||w||² / (2τ²))
whereCis a constant and||w||²is the squared L2-norm (Σ wᵢ²). - Formulate the MAP objective:
w_MAP = argmax_w [Σ log(P(yᵢ | xᵢ, w)) + log(P(w))]
To turn this into a minimization problem (like minimizing loss), we take the negative:
w_MAP = argmin_w [-Σ log(P(yᵢ | xᵢ, w)) - log(P(w))]
w_MAP = argmin_w [Loss_MLE - (C - ||w||² / (2τ²))] - Simplify: Dropping constants, the objective becomes:
w_MAP = argmin_w [Loss_MLE + (1 / (2τ²)) * ||w||²]
w_MAP = argmin_w [Loss_MLE + λ * ||w||²]
whereλ = 1 / (2τ²). - Conclusion: This is exactly the objective function for L2 Regularization (Ridge Regression). The regularization strength
λis inversely proportional to the varianceτ²of our prior. A strong prior belief that weights are near zero (smallτ²) leads to strong regularization (largeλ).
Intuition: Laplace Prior leads to L1 Regularization (Lasso)
A similar derivation can be done for the Laplace prior.
- Assume a Laplace Prior on the weights
w:P(w) ~ Laplace(0, b).
The PDF involvesexp(-|w| / b). - Calculate the log-prior: The log-prior will be proportional to
-|w|, which is the L1-norm. - Formulate the MAP objective: Minimizing the negative log-posterior becomes:
w_MAP = argmin_w [Loss_MLE + λ * ||w||₁] - Conclusion: This is the objective for L1 Regularization (Lasso Regression). The "pointy" shape of the Laplace distribution's PDF at zero encourages many weights to become exactly zero, leading to sparse models.