Huffman Coding
Lesson, slides, and applied problem sets.
View SlidesLesson
Huffman Coding
Why this module exists
Huffman coding is the canonical example of optimal prefix codes when symbol probabilities are known. It is also the point where theory meets real implementation details: tie-breaking, canonical forms, and deterministic decoding tables.
1) The Huffman idea
Given frequencies f_i, build a prefix-free code that minimizes expected length:
L = sum_i p_i * l_i
Huffman does this by repeatedly combining the two least probable symbols. This greedy step is optimal and produces a full binary tree.
2) Tree construction and tie-breaking
The algorithm is simple; the determinism is not.
If two nodes have equal weight, you must define a tie-breaker (e.g., smallest symbol id). Otherwise, different runs can generate different but still optimal codes, which breaks tests and canonicalization.
We will use:
- Primary: lower frequency first
- Tie: lower minimum symbol id in the subtree
3) From tree to code
Assign 0 to left edges, 1 to right edges. The path from root to a leaf gives the code.
If only one symbol exists, assign it a 1-bit code 0. This avoids zero-length codes.
4) Canonical Huffman
Canonical Huffman replaces tree shape with a deterministic code assignment based only on code lengths.
Steps:
- Sort symbols by (length, symbol id).
- Assign codes in order, incrementing as integers.
- Left shift when length increases.
Benefits:
- Smaller headers (only lengths are sent)
- Fast decode tables
- Deterministic output
5) Practical boundaries
- Code lengths are bounded by the number of symbols and total weight.
- For real formats, lengths are often capped (e.g., 15 in DEFLATE).
- We will not length-limit, but our constraints keep lengths small.
What you will build
- Huffman tree build + code table
- Canonical Huffman from code lengths
Key takeaways
- Huffman is optimal for a known distribution.
- Tie-breaking matters for reproducibility.
- Canonical form is the industry standard for compact headers.