Capstone: DEFLATE-Lite

Lesson, slides, and applied problem sets.

View Slides

Lesson

Capstone: DEFLATE-Lite

Why this module exists

Real compressors are pipelines. DEFLATE combines LZ77 tokenization with Huffman coding over literals, lengths, and distances. This capstone builds a simplified but fully end-to-end pipeline with explicit headers, canonical codes, and bit-level framing.


1) Pipeline overview

  1. LZ77 produces literals and matches.
  2. Huffman codes symbols based on frequency.
  3. Canonical codes reduce header size and make decoding deterministic.
  4. Bitstream framing ties everything together.

2) Token alphabets

We use two alphabets:

  • Literal/Length: bytes 0-255 plus length symbols 256-271, plus an end symbol.
  • Distance: 0-127 (distance = symbol + 1).

This is smaller than full DEFLATE but captures the same structure.


3) Fixed vs dynamic codes

We support two modes:

  • Fixed: predefined code lengths (literal/length = 9, distance = 7). Header is tiny.
  • Dynamic: code lengths are computed from frequencies and stored in the header.

The encoder can choose the smaller of the two. The decoder must accept both, indicated by a mode byte in the header.

4) Headers and determinism

Instead of complex bit-packed headers, we store code lengths explicitly in dynamic mode. This is larger but makes correctness the focus. You must build canonical codes from these lengths in the decoder.


5) What to watch out for

  • Bit order (MSB-first) must be consistent.
  • Match copying must allow overlap.
  • Token choices affect the Huffman statistics.
  • Decoder must stop on the end symbol.

What you will build

  • A full DEFLATE-like compressor + decompressor
  • Canonical Huffman tables and bitstream framing

Key takeaways

  • End-to-end correctness is a composition of many small invariants.
  • Canonical codes make binary formats practical.
  • The pipeline is as important as any single algorithm.

Module Items