Membership & Failure Detection
Lesson, slides, and applied problem sets.
View SlidesLesson
Membership & Failure Detection
Why membership matters
Distributed systems fail constantly. You need to know who is in the cluster and who is healthy in order to route traffic safely.
Gossip membership
Nodes exchange compact summaries (digests) of what they know. On each round:
- Compare digests
- Send updates for nodes you are ahead on
- Request updates for nodes you are behind on
Failure detection
Failure detectors are probabilistic. We track heartbeat intervals and compute a suspicion score. The common approach is phi accrual, which turns time-since-heartbeat into a numeric \"suspicion\" value.
What you will build
- Digest merge logic (send vs request)
- Phi accrual suspicion check
- SWIM-style suspicion updates