Huffman Coding

Lesson, slides, and applied problem sets.

View Slides

Lesson

Huffman Coding

Why this module exists

Huffman coding is the canonical example of optimal prefix codes when symbol probabilities are known. It is also the point where theory meets real implementation details: tie-breaking, canonical forms, and deterministic decoding tables.


1) The Huffman idea

Given frequencies f_i, build a prefix-free code that minimizes expected length:

L = sum_i p_i * l_i

Huffman does this by repeatedly combining the two least probable symbols. This greedy step is optimal and produces a full binary tree.


2) Tree construction and tie-breaking

The algorithm is simple; the determinism is not.

If two nodes have equal weight, you must define a tie-breaker (e.g., smallest symbol id). Otherwise, different runs can generate different but still optimal codes, which breaks tests and canonicalization.

We will use:

  • Primary: lower frequency first
  • Tie: lower minimum symbol id in the subtree

3) From tree to code

Assign 0 to left edges, 1 to right edges. The path from root to a leaf gives the code.

If only one symbol exists, assign it a 1-bit code 0. This avoids zero-length codes.


4) Canonical Huffman

Canonical Huffman replaces tree shape with a deterministic code assignment based only on code lengths.

Steps:

  1. Sort symbols by (length, symbol id).
  2. Assign codes in order, incrementing as integers.
  3. Left shift when length increases.

Benefits:

  • Smaller headers (only lengths are sent)
  • Fast decode tables
  • Deterministic output

5) Practical boundaries

  • Code lengths are bounded by the number of symbols and total weight.
  • For real formats, lengths are often capped (e.g., 15 in DEFLATE).
  • We will not length-limit, but our constraints keep lengths small.

What you will build

  1. Huffman tree build + code table
  2. Canonical Huffman from code lengths

Key takeaways

  • Huffman is optimal for a known distribution.
  • Tie-breaking matters for reproducibility.
  • Canonical form is the industry standard for compact headers.

Module Items