Dictionary Coding
Lesson, slides, and applied problem sets.
View SlidesLesson
Dictionary Coding: LZ77 & LZW
Why this module exists
Dictionary coders exploit repetition rather than symbol skew. They convert repeated substrings into compact references, then rely on a second-stage entropy coder (often Huffman) to finish the job. The key tradeoff is match choice vs code cost.
1) LZ77 in one paragraph
LZ77 scans the input and replaces repeated substrings with (distance, length) pointers into a sliding window of previous output. Tokens are either:
- Literal: a raw byte
- Match: (distance, length)
The decoder reconstructs by copying from its own output, which allows overlaps.
2) Greedy matching pitfalls
The longest match is not always the best once you consider how many bits it costs to encode length and distance. Real formats optimize across this tradeoff. We will use greedy longest match to keep the implementation focused.
3) Window size and lookahead
- Larger window finds longer matches but costs more search time.
- Lookahead caps the maximum match length.
- Both affect compression ratio and complexity.
4) LZW: implicit dictionary
LZW builds a dictionary of phrases on the fly:
- Start with all single bytes in the dictionary.
- Emit codes for the longest dictionary phrase seen.
- Add new phrases as concatenations of known ones.
The decoder rebuilds the same dictionary deterministically, including the tricky KwKwK case where a code refers to the entry being defined.
5) Why dictionary coding matters
- Captures repetition beyond symbol frequencies.
- Works well on structured data (text, logs, source code).
- Pairs naturally with entropy coding.
What you will build
- LZ77 compressor + decompressor (greedy, bounded window)
- LZW compressor + decompressor (fixed max dictionary size)
Key takeaways
- Dictionary coders exploit repetition, not just skew.
- Greedy matches are a practical compromise.
- Encoder and decoder must agree on dictionary rules.