Probabilistic Data Structures
Lesson, slides, and applied problem sets.
View SlidesLesson
Probabilistic Data Structures
Why this module exists
Sometimes you need to check "have I seen this before?" for millions of items, but you can't afford the memory for a full hash set. Probabilistic data structures trade perfect accuracy for dramatic memory savings.
A Bloom filter can tell you if an element is definitely not in a set or probably in a set — using a fraction of the memory. This is useful in databases, caches, network routers, and spam filters.
1) The Problem with Hash Sets
A hash set gives O(1) lookups, but stores every element:
Elements: "apple", "banana", "cherry", ... (1 million strings)
Memory: ~50 MB (for average 50-byte strings)
What if you just need to answer "might this be in the set?" and can tolerate occasional false positives?
2) Bloom Filter: The Idea
A Bloom filter is a bit array of size m with k hash functions.
To add an element:
- Hash the element with each of the
khash functions - Each hash gives a position (0 to m-1)
- Set those
kbits to 1
To check membership:
- Hash the element with the same
kfunctions - Check if ALL
kpositions are 1 - If any bit is 0 → definitely not in set
- If all bits are 1 → probably in set (might be false positive)
Bit array: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] (m=10)
Add "apple" with k=3 hashes → positions 2, 5, 7
[0, 0, 1, 0, 0, 1, 0, 1, 0, 0]
Add "banana" with k=3 hashes → positions 1, 5, 9
[0, 1, 1, 0, 0, 1, 0, 1, 0, 1]
Check "cherry" → positions 2, 4, 7
Position 4 is 0 → definitely NOT in set
Check "grape" → positions 1, 5, 7
All are 1 → probably in set (false positive!)
3) Why False Positives Happen
As you add more elements, more bits become 1. Eventually, a query might find all its k positions already set by other elements — a false positive.
False positives: Bloom filter says "yes" but element was never added. False negatives: Bloom filter says "no" but element was added. This never happens!
If any bit is 0, that element was definitely never added (or someone corrupted the filter).
4) Choosing Parameters
The false positive rate depends on:
m: number of bits (more bits = lower FP rate)k: number of hash functions (there's an optimal k)n: number of elements inserted
Formulas:
Optimal k:
k = (m/n) × ln(2) ≈ 0.693 × (m/n)
False positive rate (approximately):
FP ≈ (1 - e^(-kn/m))^k
Rule of thumb:
- 10 bits per element with 7 hash functions → ~1% false positive rate
- 5 bits per element with 3-4 hashes → ~10% false positive rate
5) Implementation in Go
type BloomFilter struct {
bits []bool
k int
size int
}
func NewBloomFilter(size, k int) *BloomFilter {
return &BloomFilter{
bits: make([]bool, size),
k: k,
size: size,
}
}
// Generate k hash positions using double hashing
func (bf *BloomFilter) hashPositions(value string) []int {
h1 := fnv1a(value)
h2 := djb2(value)
positions := make([]int, bf.k)
for i := 0; i < bf.k; i++ {
// Double hashing: h(i) = h1 + i*h2
positions[i] = int((h1 + uint64(i)*h2) % uint64(bf.size))
}
return positions
}
func (bf *BloomFilter) Add(value string) {
for _, pos := range bf.hashPositions(value) {
bf.bits[pos] = true
}
}
func (bf *BloomFilter) Contains(value string) bool {
for _, pos := range bf.hashPositions(value) {
if !bf.bits[pos] {
return false // Definitely not present
}
}
return true // Probably present
}
6) Double Hashing Trick
You don't need k independent hash functions. Use double hashing:
h(i) = h1(x) + i × h2(x) for i = 0, 1, ..., k-1
Where h1 and h2 are two independent hash functions (e.g., FNV-1a and DJB2, or different seeds of the same function).
This is mathematically proven to work as well as k independent functions.
7) Common Hash Functions
FNV-1a (fast, good distribution):
func fnv1a(s string) uint64 {
var h uint64 = 14695981039346656037
for i := 0; i < len(s); i++ {
h ^= uint64(s[i])
h *= 1099511628211
}
return h
}
DJB2 (simple, fast):
func djb2(s string) uint64 {
var h uint64 = 5381
for i := 0; i < len(s); i++ {
h = ((h << 5) + h) + uint64(s[i])
}
return h
}
8) When to Use Bloom Filters
Good use cases:
- Cache: "Is this URL likely cached?" (avoid disk lookup if no)
- Databases: "Might this key exist?" (avoid expensive scan if no)
- Network: "Have I seen this packet before?"
- Spell checkers: "Is this a known word?"
Bad use cases:
- Need to delete elements (standard Bloom filters don't support deletion)
- Need exact membership (no false positives allowed)
- Small sets (hash set is simpler and exact)
9) Variants
Counting Bloom Filter: Replace bits with counters to support deletion. Uses more memory.
Cuckoo Filter: Supports deletion, can be more space-efficient for some FP rates.
HyperLogLog: Estimates cardinality (count of distinct elements) using very little memory.
10) Complexity Analysis
| Operation | Time | Space |
|---|---|---|
| Add | O(k) | - |
| Contains | O(k) | - |
| Filter | - | O(m) bits |
Where k is typically 3-10 and m might be millions of bits (but still much smaller than storing elements).
Outcomes
After this module, you should be able to:
- Explain why Bloom filters have false positives but not false negatives
- Implement a Bloom filter with
khash functions - Use double hashing to generate
kpositions from 2 hash functions - Choose appropriate
mandkfor a target false positive rate - Recognize when Bloom filters are the right tool
Practice Set
- Bloom Filter (Medium) — Implement Add and Contains with configurable size and k
This single problem covers the core concept. Extensions to explore: counting Bloom filters, analyzing false positive rates experimentally.