Reliability Patterns: Circuit Breakers, Retries, Fallbacks, and Graceful Degradation
Learn how to build reliable systems that handle failures gracefully. Cover circuit breakers, retry strategies, timeouts, bulkheads, and graceful degradation patterns.
Why Reliability Matters
Every system fails eventually. The question is whether your system degrades gracefully or collapses completely. Reliability patterns help you build systems that survive failures.
Key insight: Reliability ≠ availability. A system can be available but unreliable (returning errors). Aim for both: high availability AND reliable responses.
Failure Modes
Types of Failures
| Type | Description | Example |
|---|---|---|
| Transient | Temporary, retry might work | Network timeout |
| Permanent | Won't recover without intervention | Disk full |
| Byzantine | Unpredictable, malicious | Corrupted data |
| Partial | Some requests work | Single server down |
Timeouts
Always set timeouts. Without them, a slow service can cause your entire system to hang.
Timeout Strategy
Setting Timeouts
The code configures HTTP request timeouts using httpx with a 5-second total timeout and 1-second connection timeout. Always setting timeouts prevents requests from waiting indefinitely when services fail.
| Service Type | Timeout Recommendation |
|---|---|
| Fast cache (Redis) | 10-50ms |
| Database | 100-500ms |
| External API | 1-5s |
| Long-running job | Async + polling |
Retry Patterns
When to Retry
Exponential Backoff
Exponential backoff increases wait time between retries exponentially (1s, 2s, 4s...) with random jitter to prevent thundering herd. The jitter adds randomness to avoid synchronized retries from multiple clients.
| Attempt | Base Delay | With Jitter |
|---|---|---|
| 1 | 1s | 1.0-1.5s |
| 2 | 2s | 2.0-3.0s |
| 3 | 4s | 4.0-6.0s |
| 4 | 8s | 8.0-12.0s |
Don't retry blindly: Only retry transient errors (timeouts, 5xx). Never retry 4xx errors (they're client mistakes). Implement retry budgets to prevent thundering herd.
Circuit Breaker Pattern
Prevent cascading failures by stopping calls to a failing service.
States
Implementation
The circuit breaker tracks failures and opens the circuit when a threshold is reached. In the closed state (normal), it executes the function. In the open state, it rejects calls immediately. After a timeout, it moves to half-open state to test if the service has recovered.
Circuit Breaker in Action
Bulkhead Pattern
Isolate components so that failure in one doesn't affect others.
| Pattern | Purpose |
|---|---|
| Connection pool per service | One slow DB doesn't block others |
| Separate thread pools | Isolate computation |
| Dedicated instances | Complete isolation |
Fallback and Graceful Degradation
Fallback Strategies
Code Example
def get_product_recommendations(user_id):
try:
# Try primary service
recommendations = ml_service.get_recommendations(user_id)
return recommendations
except MLServiceError:
# Fallback 1: Cached recommendations
cached = cache.get(f"recs:{user_id}")
if cached:
return cached
# Fallback 2: Popular products
return popular_products()
Graceful Degradation Matrix
| Component Down | Degradation |
|---|---|
| Recommendations | Show popular items |
| Reviews | Hide review section |
| Search | Show cached results |
| Payments | Queue for later |
| Analytics | Batch process later |
Health Checks
Helpful for orchestration systems to know when to route traffic or restart services.
Health Check Levels
| Level | Checks | Use Case |
|---|---|---|
| Liveness | Is the process alive? | Kubernetes liveness probe |
| Readiness | Can handle traffic? | Kubernetes readiness probe |
| Startup | Is initialization complete? | Kubernetes startup probe |
Health Check Implementation
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
async def health():
checks = {
"database": check_database(),
"cache": check_cache(),
"external_api": check_external_api(),
}
all_healthy = all(checks.values())
return {
"status": "healthy" if all_healthy else "unhealthy",
"checks": checks
}, 200 if all_healthy else 503
Chaos Engineering
Test your reliability by deliberately breaking things.
| Chaos Tool | What It Tests |
|---|---|
| Chaos Monkey | Kill random instances |
| Litmus | Kubernetes chaos |
| Gremlin | Targeted attacks |
| AWS Fault Injection Simulator | Cloud infrastructure |
Start small: Test in staging first. Have rollback plans. Start with non-critical services. Gradually move to production with proper monitoring.
Reliability Checklist
| Pattern | When to Use |
|---|---|
| Timeouts | Always - prevent infinite waits |
| Retries | Transient failures, idempotent operations |
| Circuit Breaker | Call external services that might fail |
| Bulkhead | Isolate critical resources |
| Fallback | Degrade gracefully when primary fails |
| Health Checks | Help orchestrators manage your service |
| Chaos Engineering | Test your reliability assumptions |
What to Remember for Interviews
- Timeouts are essential: Always set them; infinite waits are never acceptable
- Retry with backoff: Exponential backoff + jitter prevents thundering herd
- Circuit breakers: Protect against cascading failures
- Graceful degradation: Have fallback strategies for every critical path
- Test for failure: Use chaos engineering to verify reliability
Practice: For any system design, ask: "What happens if X fails?" Then implement the appropriate reliability pattern.