Probabilistic Data Structures

Lesson, slides, and applied problem sets.

View Slides

Lesson

Probabilistic Data Structures

Why this module exists

Sometimes you need to check "have I seen this before?" for millions of items, but you can't afford the memory for a full hash set. Probabilistic data structures trade perfect accuracy for dramatic memory savings.

A Bloom filter can tell you if an element is definitely not in a set or probably in a set — using a fraction of the memory. This is useful in databases, caches, network routers, and spam filters.


1) The Problem with Hash Sets

A hash set gives O(1) lookups, but stores every element:

Elements: "apple", "banana", "cherry", ... (1 million strings)
Memory: ~50 MB (for average 50-byte strings)

What if you just need to answer "might this be in the set?" and can tolerate occasional false positives?


2) Bloom Filter: The Idea

A Bloom filter is a bit array of size m with k hash functions.

To add an element:

  1. Hash the element with each of the k hash functions
  2. Each hash gives a position (0 to m-1)
  3. Set those k bits to 1

To check membership:

  1. Hash the element with the same k functions
  2. Check if ALL k positions are 1
  3. If any bit is 0 → definitely not in set
  4. If all bits are 1 → probably in set (might be false positive)
Bit array:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  (m=10)

Add "apple" with k=3 hashes → positions 2, 5, 7
            [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]

Add "banana" with k=3 hashes → positions 1, 5, 9
            [0, 1, 1, 0, 0, 1, 0, 1, 0, 1]

Check "cherry" → positions 2, 4, 7
            Position 4 is 0 → definitely NOT in set

Check "grape" → positions 1, 5, 7
            All are 1 → probably in set (false positive!)

3) Why False Positives Happen

As you add more elements, more bits become 1. Eventually, a query might find all its k positions already set by other elements — a false positive.

False positives: Bloom filter says "yes" but element was never added. False negatives: Bloom filter says "no" but element was added. This never happens!

If any bit is 0, that element was definitely never added (or someone corrupted the filter).


4) Choosing Parameters

The false positive rate depends on:

  • m: number of bits (more bits = lower FP rate)
  • k: number of hash functions (there's an optimal k)
  • n: number of elements inserted

Formulas:

Optimal k:

k = (m/n) × ln(2) ≈ 0.693 × (m/n)

False positive rate (approximately):

FP ≈ (1 - e^(-kn/m))^k

Rule of thumb:

  • 10 bits per element with 7 hash functions → ~1% false positive rate
  • 5 bits per element with 3-4 hashes → ~10% false positive rate

5) Implementation in Go

type BloomFilter struct {
    bits []bool
    k    int
    size int
}

func NewBloomFilter(size, k int) *BloomFilter {
    return &BloomFilter{
        bits: make([]bool, size),
        k:    k,
        size: size,
    }
}

// Generate k hash positions using double hashing
func (bf *BloomFilter) hashPositions(value string) []int {
    h1 := fnv1a(value)
    h2 := djb2(value)

    positions := make([]int, bf.k)
    for i := 0; i < bf.k; i++ {
        // Double hashing: h(i) = h1 + i*h2
        positions[i] = int((h1 + uint64(i)*h2) % uint64(bf.size))
    }
    return positions
}

func (bf *BloomFilter) Add(value string) {
    for _, pos := range bf.hashPositions(value) {
        bf.bits[pos] = true
    }
}

func (bf *BloomFilter) Contains(value string) bool {
    for _, pos := range bf.hashPositions(value) {
        if !bf.bits[pos] {
            return false  // Definitely not present
        }
    }
    return true  // Probably present
}

6) Double Hashing Trick

You don't need k independent hash functions. Use double hashing:

h(i) = h1(x) + i × h2(x)   for i = 0, 1, ..., k-1

Where h1 and h2 are two independent hash functions (e.g., FNV-1a and DJB2, or different seeds of the same function).

This is mathematically proven to work as well as k independent functions.


7) Common Hash Functions

FNV-1a (fast, good distribution):

func fnv1a(s string) uint64 {
    var h uint64 = 14695981039346656037
    for i := 0; i < len(s); i++ {
        h ^= uint64(s[i])
        h *= 1099511628211
    }
    return h
}

DJB2 (simple, fast):

func djb2(s string) uint64 {
    var h uint64 = 5381
    for i := 0; i < len(s); i++ {
        h = ((h << 5) + h) + uint64(s[i])
    }
    return h
}

8) When to Use Bloom Filters

Good use cases:

  • Cache: "Is this URL likely cached?" (avoid disk lookup if no)
  • Databases: "Might this key exist?" (avoid expensive scan if no)
  • Network: "Have I seen this packet before?"
  • Spell checkers: "Is this a known word?"

Bad use cases:

  • Need to delete elements (standard Bloom filters don't support deletion)
  • Need exact membership (no false positives allowed)
  • Small sets (hash set is simpler and exact)

9) Variants

Counting Bloom Filter: Replace bits with counters to support deletion. Uses more memory.

Cuckoo Filter: Supports deletion, can be more space-efficient for some FP rates.

HyperLogLog: Estimates cardinality (count of distinct elements) using very little memory.


10) Complexity Analysis

OperationTimeSpace
AddO(k)-
ContainsO(k)-
Filter-O(m) bits

Where k is typically 3-10 and m might be millions of bits (but still much smaller than storing elements).


Outcomes

After this module, you should be able to:

  • Explain why Bloom filters have false positives but not false negatives
  • Implement a Bloom filter with k hash functions
  • Use double hashing to generate k positions from 2 hash functions
  • Choose appropriate m and k for a target false positive rate
  • Recognize when Bloom filters are the right tool

Practice Set

  1. Bloom Filter (Medium) — Implement Add and Contains with configurable size and k

This single problem covers the core concept. Extensions to explore: counting Bloom filters, analyzing false positive rates experimentally.


Module Items