Observability & Resilience

Lesson, slides, and applied problem sets.

View Slides

Lesson

Observability & Resilience

Why it matters

Real systems fail and slow down. You need metrics to see it and controls to keep the system alive when it happens.


SLOs and tail latency

  • SLOs define what "good" looks like
  • Tail latency (p99/p999) drives user experience
  • Error budgets tell you how fast you're burning reliability

Resilience controls

  • Retries can amplify load
  • Circuit breakers stop cascades
  • Backpressure & load shedding protect the core

What you will build

  • P99 + burn-rate metrics
  • Circuit breaker simulation
  • Backpressure queue simulation
  • Token bucket rate limiter
  • Retry backoff scheduling

Module Items

  • SLO Metrics: P99 & Burn Rate

    Compute p99 latency and error budget burn rate.

    medium Sign in to access medium and hard problems
  • Backpressure Queue Simulation

    Simulate bounded queue backpressure and shedding.

    medium Sign in to access medium and hard problems
  • Token Bucket Rate Limiter

    Simulate token bucket rate limiting decisions.

    medium Sign in to access medium and hard problems
  • Retry Backoff Schedule

    Schedule retries with exponential backoff.

    medium Sign in to access medium and hard problems
  • Circuit Breaker Simulation

    Simulate circuit breaker state transitions.

    medium Sign in to access medium and hard problems
Join Discord