Observability & Resilience
Lesson, slides, and applied problem sets.
View SlidesLesson
Observability & Resilience
Why it matters
Real systems fail and slow down. You need metrics to see it and controls to keep the system alive when it happens.
SLOs and tail latency
- SLOs define what "good" looks like
- Tail latency (p99/p999) drives user experience
- Error budgets tell you how fast you're burning reliability
Resilience controls
- Retries can amplify load
- Circuit breakers stop cascades
- Backpressure & load shedding protect the core
What you will build
- P99 + burn-rate metrics
- Circuit breaker simulation
- Backpressure queue simulation
- Token bucket rate limiter
- Retry backoff scheduling
Module Items
SLO Metrics: P99 & Burn Rate
Compute p99 latency and error budget burn rate.
Backpressure Queue Simulation
Simulate bounded queue backpressure and shedding.
Token Bucket Rate Limiter
Simulate token bucket rate limiting decisions.
Retry Backoff Schedule
Schedule retries with exponential backoff.
Circuit Breaker Simulation
Simulate circuit breaker state transitions.