Unit 5 - Notes

MTH265 7 min read

Unit 5: Discrete Probability -II

1. Bayes’ Theorem and Generalized Bayes’ Theorem

1.1 Introduction and Foundation

Before delving into Bayes' theorem, it is essential to recall the definition of conditional probability. The probability of event $A$ occurring given that event $B$ has already occurred is denoted as $P(A|B)$ and is defined by:
$P(A|B) = \frac{P(A \cap B)}{P(B)}$ assuming $P(B) > 0$ .

1.2 Bayes' Theorem (Standard Form)

Bayes’ theorem provides a way to revise existing predictions or theories (update probabilities) given new or additional evidence. It relates the conditional and marginal probabilities of two random events.

Formula:
$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Where:

$P(A|B)$ : Posterior probability (probability of $A$ after observing $B$ ).
$P(B|A)$ : Likelihood (probability of observing $B$ given that $A$ is true).
$P(A)$ : Prior probability (initial probability of $A$ ).
$P(B)$ : Marginal probability (total probability of observing $B$ ).

1.3 The Law of Total Probability

To calculate the denominator $P(B)$ in Bayes' theorem, we often use the Law of Total Probability. If $A$ and $A'$ (complement of $A$ ) partition the sample space, then:
$P(B) = P(B|A)P(A) + P(B|A')P(A')$

1.4 Generalized Bayes’ Theorem

The generalized form expands upon the standard form by allowing the sample space to be partitioned into $n$ mutually exclusive and exhaustive events: $A_1, A_2, ..., A_n$ .

Formula:
$P(A_i|B) = \frac{P(B|A_i) \cdot P(A_i)}{\sum_{j=1}^{n} P(B|A_j) \cdot P(A_j)}$

Application: This is widely used in machine learning (e.g., Naive Bayes classifiers), medical diagnostics, and spam filtering where multiple underlying conditions could result in the observed evidence.

2. Expected Values

2.1 Definition

The expected value (or expectation/mean) of a discrete random variable is a weighted average of all possible values that the random variable can take on. The weights are the probabilities of the respective values occurring. It represents the "center of mass" of a distribution or the long-run average of the random variable over many independent trials.

Formula:
If $X$ is a discrete random variable with a sample space $S$ and probability mass function $P(x)$ , the expected value $E(X)$ is:
$E(X) = \sum_{x \in S} x \cdot P(X = x)$

2.2 Example

Consider rolling a fair six-sided die. Let $X$ be the outcome of the roll.

Values $x$ : 1, 2, 3, 4, 5, 6
Probabilities $P(X=x)$ : $1/6$ for each $x$ .
$E(X) = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6) = 21/6 = 3.5$
Notice that the expected value (3.5) is not a possible outcome of a single roll, but rather the average over many rolls.

3. Linearity of Expectations

3.1 Concept

Linearity of expectations is a fundamental and powerful property of expected values. It states that the expected value of a sum of random variables is equal to the sum of their individual expected values.

Crucial Property: Linearity of expectations holds regardless of whether the random variables are independent or dependent.

3.2 Formulas

For any two random variables $X$ and $Y$ and any real numbers $a$ and $b$ :

Addition: $E(X + Y) = E(X) + E(Y)$
Scalar Multiplication: $E(aX) = a \cdot E(X)$
General Linear Combination: $E(aX + bY + c) = aE(X) + bE(Y) + c$

3.3 Application Example

If you roll two dice, what is the expected sum?
Let $X$ be the first die, $Y$ be the second die. We want $E(X+Y)$ .
By linearity of expectation: $E(X+Y) = E(X) + E(Y)$ .
Since $E(X) = 3.5$ and $E(Y) = 3.5$ , $E(X+Y) = 3.5 + 3.5 = 7$ .

4. The Geometric Distribution

4.1 Definition

The geometric distribution models the number of independent Bernoulli trials needed to get the first success.
A Bernoulli trial is an experiment with exactly two possible outcomes: "Success" (with probability $p$ ) and "Failure" (with probability $1-p$ ).

4.2 Probability Mass Function (PMF)

Let $X$ be the random variable representing the number of trials needed to achieve the first success.
$P(X = k) = (1 - p)^{k-1} \cdot p$
Where:

$k \in \{1, 2, 3, ...\}$ (number of trials)
$p$ is the probability of success on a single trial.
$(1-p)^{k-1}$ represents $k-1$ consecutive failures.

4.3 Key Metrics

Expected Value:
- (e.g., If the chance of flipping a heads is $0.5$, you expect to flip the coin $1/0.5 = 2$ times to get the first heads.)
Variance: $Var(X) = \frac{1-p}{p^2}$

4.4 Memoryless Property

The geometric distribution is the only discrete probability distribution with the memoryless property. This means that the probability of success on the next trial is independent of how many failures have already occurred:
$P(X > m + n | X > m) = P(X > n)$

5. Independent Random Variables

5.1 Definition

Two random variables $X$ and $Y$ are independent if the realization of one does not affect the probability distribution of the other. Mathematically, $X$ and $Y$ are independent if and only if, for all possible values $x$ and $y$ :
$P(X = x \text{ and } Y = y) = P(X = x) \cdot P(Y = y)$

5.2 Independence and Expectations

If $X$ and $Y$ are independent random variables, then the expected value of their product is the product of their expected values:
$E(XY) = E(X) \cdot E(Y)$
(Note: The converse is not necessarily true. If $E(XY) = E(X)E(Y)$ , the variables are uncorrelated, but not guaranteed to be strictly independent).

6. Variance

6.1 Definition

While the expected value gives the center of a distribution, the variance measures the spread or dispersion of the random variable around its expected value. It is the expected value of the squared deviation from the mean.

6.2 Formulas

Let $\mu = E(X)$ . The variance $Var(X)$ (often denoted as $\sigma^2$ ) is defined as:
$Var(X) = E[(X - \mu)^2]$

Alternative (Computational) Formula:
A more computationally convenient formula derived from the definition is:
$Var(X) = E(X^2) - [E(X)]^2$

6.3 Standard Deviation

The standard deviation $\sigma$ is the square root of the variance. It is often preferred because it is in the same units as the random variable $X$ .
$\sigma_X = \sqrt{Var(X)}$

6.4 Properties of Variance

Unlike expectation, variance is not a linear operator.

$Var(X + c) = Var(X)$ (Adding a constant shifts the distribution but does not change the spread).
$Var(aX) = a^2 \cdot Var(X)$ (Multiplying a random variable by a constant scales the variance by the square of that constant).
$Var(aX + b) = a^2 \cdot Var(X)$

7. Bienaymé’s Formula

7.1 Concept

Bienaymé’s formula provides a rule for calculating the variance of the sum of mutually independent random variables. While the expectation of a sum is always the sum of the expectations, the variance of a sum is only the sum of the variances if the variables are uncorrelated (which is guaranteed if they are pairwise independent).

7.2 The Formula

If $X_1, X_2, ..., X_n$ are pairwise independent random variables, then the variance of their sum equals the sum of their variances:
$Var(X_1 + X_2 + ... + X_n) = Var(X_1) + Var(X_2) + ... + Var(X_n)$

Alternatively written as:
$Var\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} Var(X_i)$

7.3 Why Independence Matters

If the random variables are not independent, the variance of their sum must account for their covariance. For two variables $X$ and $Y$ :
$Var(X + Y) = Var(X) + Var(Y) + 2 \cdot Cov(X, Y)$
If $X$ and $Y$ are independent, their covariance $Cov(X, Y) = 0$ , which reduces the equation to Bienaymé’s formula for $n=2$ .

Unit 4

Unit 6