Membership & Failure Detection

Lesson, slides, and applied problem sets.

View Slides

Lesson

Membership & Failure Detection

Why membership matters

Distributed systems fail constantly. You need to know who is in the cluster and who is healthy in order to route traffic safely.


Gossip membership

Nodes exchange compact summaries (digests) of what they know. On each round:

  • Compare digests
  • Send updates for nodes you are ahead on
  • Request updates for nodes you are behind on

Failure detection

Failure detectors are probabilistic. We track heartbeat intervals and compute a suspicion score. The common approach is phi accrual, which turns time-since-heartbeat into a numeric \"suspicion\" value.


What you will build

  • Digest merge logic (send vs request)
  • Phi accrual suspicion check
  • SWIM-style suspicion updates

Module Items