Dictionary Coding

Lesson, slides, and applied problem sets.

View Slides

Lesson

Dictionary Coding: LZ77 & LZW

Why this module exists

Dictionary coders exploit repetition rather than symbol skew. They convert repeated substrings into compact references, then rely on a second-stage entropy coder (often Huffman) to finish the job. The key tradeoff is match choice vs code cost.


1) LZ77 in one paragraph

LZ77 scans the input and replaces repeated substrings with (distance, length) pointers into a sliding window of previous output. Tokens are either:

  • Literal: a raw byte
  • Match: (distance, length)

The decoder reconstructs by copying from its own output, which allows overlaps.


2) Greedy matching pitfalls

The longest match is not always the best once you consider how many bits it costs to encode length and distance. Real formats optimize across this tradeoff. We will use greedy longest match to keep the implementation focused.


3) Window size and lookahead

  • Larger window finds longer matches but costs more search time.
  • Lookahead caps the maximum match length.
  • Both affect compression ratio and complexity.

4) LZW: implicit dictionary

LZW builds a dictionary of phrases on the fly:

  • Start with all single bytes in the dictionary.
  • Emit codes for the longest dictionary phrase seen.
  • Add new phrases as concatenations of known ones.

The decoder rebuilds the same dictionary deterministically, including the tricky KwKwK case where a code refers to the entry being defined.


5) Why dictionary coding matters

  • Captures repetition beyond symbol frequencies.
  • Works well on structured data (text, logs, source code).
  • Pairs naturally with entropy coding.

What you will build

  1. LZ77 compressor + decompressor (greedy, bounded window)
  2. LZW compressor + decompressor (fixed max dictionary size)

Key takeaways

  • Dictionary coders exploit repetition, not just skew.
  • Greedy matches are a practical compromise.
  • Encoder and decoder must agree on dictionary rules.

Module Items