Probability Foundations

Lesson, slides, and applied problem sets.

View Slides

Lesson

Probability Foundations

Why this module exists

Machine learning is fundamentally about uncertainty. We never have perfect data or perfect models. Probability gives us the language to reason about uncertainty, make predictions with confidence intervals, and build models that learn from data.

Understanding probability deeply will transform how you think about classification, generative models, and model uncertainty.


1) What is probability?

Probability measures the likelihood of events. For an event A:

0 ≤ P(A) ≤ 1
P(impossible) = 0
P(certain) = 1

Two interpretations:

  • Frequentist: Long-run frequency of occurrence
  • Bayesian: Degree of belief, updated with evidence

2) Sample space and events

The sample space Ω is the set of all possible outcomes.

An event is a subset of the sample space.

Example: Rolling a die

  • Sample space: {1, 2, 3, 4, 5, 6}
  • Event "even number": {2, 4, 6}
  • P(even) = 3/6 = 0.5

3) Probability axioms

  1. Non-negativity: P(A) ≥ 0
  2. Normalization: P(Ω) = 1 (something must happen)
  3. Additivity: For mutually exclusive events A, B: P(A or B) = P(A) + P(B)

From these, everything else follows.


4) Conditional probability

The probability of A given B has occurred:

P(A|B) = P(A and B) / P(B)

Conditioning restricts the sample space to only outcomes where B is true.

Example:

  • P(rain tomorrow | cloudy today) is different from P(rain tomorrow)
  • Past information updates our belief

5) Bayes' theorem

The most important formula in probabilistic ML:

P(A|B) = P(B|A) * P(A) / P(B)

Or in ML terms:

P(hypothesis|data) = P(data|hypothesis) * P(hypothesis) / P(data)
     posterior     =     likelihood     *     prior     / evidence

Bayes lets us update beliefs as we see more data.


6) Independence

Events A and B are independent if knowing one tells you nothing about the other:

P(A and B) = P(A) * P(B)
P(A|B) = P(A)

Example: Two fair coin flips are independent.

Conditional independence: A and B are independent given C:

P(A and B | C) = P(A|C) * P(B|C)

Many ML models assume conditional independence (Naive Bayes).


7) Random variables

A random variable is a function that assigns numbers to outcomes.

Discrete: Finite or countable values (die roll, coin flip, word count) Continuous: Any value in a range (height, temperature, probability)


8) Probability distributions

A distribution describes how probability is spread across values.

Discrete distributions

Bernoulli: Single binary trial (coin flip)

P(X=1) = p, P(X=0) = 1-p

Binomial: Number of successes in n trials

P(X=k) = C(n,k) * p^k * (1-p)^(n-k)

Categorical: One of K classes (generalized coin)

Continuous distributions

Uniform: Equal probability across range [a, b]

Gaussian (Normal): The famous bell curve

p(x) = (1/√(2πσ²)) * exp(-(x-μ)²/(2σ²))

Parameterized by mean μ and standard deviation σ.


9) Expected value and variance

Expected value (mean): The weighted average outcome

E[X] = Σ x * P(X=x)          (discrete)
E[X] = ∫ x * p(x) dx         (continuous)

Variance: How spread out the distribution is

Var[X] = E[(X - E[X])²] = E[X²] - (E[X])²

Standard deviation: √Variance (same units as X)


10) Joint and marginal distributions

Joint distribution: Probability of multiple variables together

P(X=x, Y=y) = probability that X=x AND Y=y

Marginal distribution: Collapse over one variable

P(X=x) = Σ_y P(X=x, Y=y)

Chain rule:

P(X, Y) = P(X|Y) * P(Y) = P(Y|X) * P(X)

11) Maximum Likelihood Estimation (MLE)

Given data, what parameters make the data most probable?

θ_MLE = argmax_θ P(data | θ)

Usually we maximize log-likelihood (easier math):

θ_MLE = argmax_θ Σ log P(x_i | θ)

Example: For coin flips, MLE of p = (# heads) / (# flips)


12) Probability in classification

In classification, we want P(class | features):

Using Bayes:

P(class | x) = P(x | class) * P(class) / P(x)

We often just need to compare classes, so P(x) cancels:

predicted_class = argmax_class P(x | class) * P(class)

This is the foundation of Naive Bayes classifiers.


Key takeaways

  • Probability quantifies uncertainty; essential for ML
  • Conditional probability P(A|B) updates beliefs with evidence
  • Bayes' theorem: posterior ∝ likelihood × prior
  • Independence simplifies computations (Naive Bayes assumption)
  • Distributions describe how probability spreads (Gaussian is everywhere)
  • MLE finds parameters that maximize data probability
  • Classification is about P(class | features)

Module Items