Observability & Resilience

Lesson, slides, and applied problem sets.

View Slides

Lesson

Observability & Resilience

Why it matters

Real systems fail and slow down. You need metrics to see it and controls to keep the system alive when it happens.


SLOs and tail latency

  • SLOs define what "good" looks like
  • Tail latency (p99/p999) drives user experience
  • Error budgets tell you how fast you're burning reliability

Resilience controls

  • Retries can amplify load
  • Circuit breakers stop cascades
  • Backpressure & load shedding protect the core

What you will build

  • P99 + burn-rate metrics
  • Circuit breaker simulation
  • Backpressure queue simulation
  • Token bucket rate limiter
  • Retry backoff scheduling

Module Items