Observability & Resilience
Lesson, slides, and applied problem sets.
View SlidesLesson
Observability & Resilience
Why it matters
Real systems fail and slow down. You need metrics to see it and controls to keep the system alive when it happens.
SLOs and tail latency
- SLOs define what "good" looks like
- Tail latency (p99/p999) drives user experience
- Error budgets tell you how fast you're burning reliability
Resilience controls
- Retries can amplify load
- Circuit breakers stop cascades
- Backpressure & load shedding protect the core
What you will build
- P99 + burn-rate metrics
- Circuit breaker simulation
- Backpressure queue simulation
- Token bucket rate limiter
- Retry backoff scheduling
Module Items
SLO Metrics: P99 & Burn Rate
Backpressure Queue Simulation
Token Bucket Rate Limiter
Retry Backoff Schedule
Circuit Breaker Simulation