Observability & Resilience

Lesson, slides, and applied problem sets.

Lesson

Observability & Resilience

Why it matters

Real systems fail and slow down. You need metrics to see it and controls to keep the system alive when it happens.

SLOs and tail latency

SLOs define what "good" looks like
Tail latency (p99/p999) drives user experience
Error budgets tell you how fast you're burning reliability

Resilience controls

Retries can amplify load
Circuit breakers stop cascades
Backpressure & load shedding protect the core

What you will build

P99 + burn-rate metrics
Circuit breaker simulation
Backpressure queue simulation
Token bucket rate limiter
Retry backoff scheduling

Module Items