Compression Foundations & Entropy

Lesson, slides, and applied problem sets.

View Slides

Lesson

Compression Foundations & Entropy

Why this module exists

Compression is not magic. It is modeling + coding. If your model predicts the data well, codes get short; if it does not, you pay in bits. This module builds the quantitative intuition that every later algorithm relies on: entropy, expected code length, and the constraints that make prefix codes decodable.

We will treat compression as a trade between uncertainty (entropy), model choice, and coding rules (prefix-free). The goal is to make every line of code you write later explainable in terms of these fundamentals.


1) Information content and entropy

For a symbol with probability p, the information content is:

I(x) = -log2(p)

A distribution's entropy is the expected information:

H = -sum_i p_i log2(p_i)

Intuition:

  • A certain event (p=1) carries 0 bits.
  • Rare events carry more bits.
  • Entropy is the minimum average bits needed by any code if you can choose the code with full knowledge of the distribution.

2) Expected code length and the Shannon bound

If you assign a code length l_i to each symbol, the expected length is:

L = sum_i p_i * l_i

Shannon's source coding theorem says:

H <= L < H + 1

for optimal prefix codes when l_i are integers. This is the central performance target: approach entropy without crossing it.


3) Prefix-free codes and Kraft's inequality

A code is prefix-free if no codeword is a prefix of another. This guarantees instantaneous decoding.

The code lengths l_i are feasible iff they satisfy Kraft's inequality:

sum_i 2^{-l_i} <= 1

This is a purely combinatorial constraint. It tells you which length assignments are even possible before you build any tree.


4) Redundancy and compression ratio

  • Redundancy = L - H (average extra bits beyond entropy).
  • Compression ratio = compressed_size / original_size.

You can have low redundancy but still a poor compression ratio if your model is wrong. The model chooses probabilities; the code only realizes them.


5) Modeling and mismatch

Real data is not i.i.d. A good model captures structure:

  • Local repetition (LZ)
  • Skewed symbol frequencies (Huffman)
  • Context dependencies (adaptive / arithmetic)

If the model's p_i are wrong, cross-entropy measures the penalty:

H(p, q) = -sum_i p_i log2(q_i)

This is exactly the expected length under the wrong model q.


6) Greedy pitfalls and global optimality

Many compressors make local greedy choices (e.g., longest match in LZ77). Greedy choices can be optimal for a fixed tokenization rule, but not necessarily for global compression ratio once coding costs are considered.

This is a recurring theme: local choices are only good if they align with the global code cost. We will see this in LZ77 (match choice), Huffman (tree shape), and range coding (probability modeling).


7) What you will build

  1. Entropy estimator: compute entropy and expected code length.
  2. Kraft checker: validate prefix feasibility.
  3. Later modules: build real coders and compare to the Shannon bound.

Key takeaways

  • Entropy is the theoretical floor; codes can approach but not beat it.
  • Prefix-free codes are constrained by Kraft's inequality.
  • Good compression is mostly good modeling.
  • Greedy decisions must align with coding cost.

Next, we make these formulas concrete in code and tests.


Module Items