Probability Foundations
Lesson, slides, and applied problem sets.
View SlidesLesson
Probability Foundations
Why this module exists
Machine learning is fundamentally about uncertainty. We never have perfect data or perfect models. Probability gives us the language to reason about uncertainty, make predictions with confidence intervals, and build models that learn from data.
Understanding probability deeply will transform how you think about classification, generative models, and model uncertainty.
1) What is probability?
Probability measures the likelihood of events. For an event A:
0 ≤ P(A) ≤ 1
P(impossible) = 0
P(certain) = 1
Two interpretations:
- Frequentist: Long-run frequency of occurrence
- Bayesian: Degree of belief, updated with evidence
2) Sample space and events
The sample space Ω is the set of all possible outcomes.
An event is a subset of the sample space.
Example: Rolling a die
- Sample space: {1, 2, 3, 4, 5, 6}
- Event "even number": {2, 4, 6}
- P(even) = 3/6 = 0.5
3) Probability axioms
- Non-negativity: P(A) ≥ 0
- Normalization: P(Ω) = 1 (something must happen)
- Additivity: For mutually exclusive events A, B: P(A or B) = P(A) + P(B)
From these, everything else follows.
4) Conditional probability
The probability of A given B has occurred:
P(A|B) = P(A and B) / P(B)
Conditioning restricts the sample space to only outcomes where B is true.
Example:
- P(rain tomorrow | cloudy today) is different from P(rain tomorrow)
- Past information updates our belief
5) Bayes' theorem
The most important formula in probabilistic ML:
P(A|B) = P(B|A) * P(A) / P(B)
Or in ML terms:
P(hypothesis|data) = P(data|hypothesis) * P(hypothesis) / P(data)
posterior = likelihood * prior / evidence
Bayes lets us update beliefs as we see more data.
6) Independence
Events A and B are independent if knowing one tells you nothing about the other:
P(A and B) = P(A) * P(B)
P(A|B) = P(A)
Example: Two fair coin flips are independent.
Conditional independence: A and B are independent given C:
P(A and B | C) = P(A|C) * P(B|C)
Many ML models assume conditional independence (Naive Bayes).
7) Random variables
A random variable is a function that assigns numbers to outcomes.
Discrete: Finite or countable values (die roll, coin flip, word count) Continuous: Any value in a range (height, temperature, probability)
8) Probability distributions
A distribution describes how probability is spread across values.
Discrete distributions
Bernoulli: Single binary trial (coin flip)
P(X=1) = p, P(X=0) = 1-p
Binomial: Number of successes in n trials
P(X=k) = C(n,k) * p^k * (1-p)^(n-k)
Categorical: One of K classes (generalized coin)
Continuous distributions
Uniform: Equal probability across range [a, b]
Gaussian (Normal): The famous bell curve
p(x) = (1/√(2πσ²)) * exp(-(x-μ)²/(2σ²))
Parameterized by mean μ and standard deviation σ.
9) Expected value and variance
Expected value (mean): The weighted average outcome
E[X] = Σ x * P(X=x) (discrete)
E[X] = ∫ x * p(x) dx (continuous)
Variance: How spread out the distribution is
Var[X] = E[(X - E[X])²] = E[X²] - (E[X])²
Standard deviation: √Variance (same units as X)
10) Joint and marginal distributions
Joint distribution: Probability of multiple variables together
P(X=x, Y=y) = probability that X=x AND Y=y
Marginal distribution: Collapse over one variable
P(X=x) = Σ_y P(X=x, Y=y)
Chain rule:
P(X, Y) = P(X|Y) * P(Y) = P(Y|X) * P(X)
11) Maximum Likelihood Estimation (MLE)
Given data, what parameters make the data most probable?
θ_MLE = argmax_θ P(data | θ)
Usually we maximize log-likelihood (easier math):
θ_MLE = argmax_θ Σ log P(x_i | θ)
Example: For coin flips, MLE of p = (# heads) / (# flips)
12) Probability in classification
In classification, we want P(class | features):
Using Bayes:
P(class | x) = P(x | class) * P(class) / P(x)
We often just need to compare classes, so P(x) cancels:
predicted_class = argmax_class P(x | class) * P(class)
This is the foundation of Naive Bayes classifiers.
Key takeaways
- Probability quantifies uncertainty; essential for ML
- Conditional probability P(A|B) updates beliefs with evidence
- Bayes' theorem: posterior ∝ likelihood × prior
- Independence simplifies computations (Naive Bayes assumption)
- Distributions describe how probability spreads (Gaussian is everywhere)
- MLE finds parameters that maximize data probability
- Classification is about P(class | features)