Security & Reliability

Reliability Patterns: Circuit Breakers, Retries, Fallbacks, and Graceful Degradation

Learn how to build reliable systems that handle failures gracefully. Cover circuit breakers, retry strategies, timeouts, bulkheads, and graceful degradation patterns.

reliabilitycircuit breakerretryfallbackresiliencefault tolerance

Why Reliability Matters

Every system fails eventually. The question is whether your system degrades gracefully or collapses completely. Reliability patterns help you build systems that survive failures.

Key insight: Reliability ≠ availability. A system can be available but unreliable (returning errors). Aim for both: high availability AND reliable responses.


Failure Modes

Types of Failures

TypeDescriptionExample
TransientTemporary, retry might workNetwork timeout
PermanentWon't recover without interventionDisk full
ByzantineUnpredictable, maliciousCorrupted data
PartialSome requests workSingle server down

Timeouts

Always set timeouts. Without them, a slow service can cause your entire system to hang.

Timeout Strategy

Setting Timeouts

The code configures HTTP request timeouts using httpx with a 5-second total timeout and 1-second connection timeout. Always setting timeouts prevents requests from waiting indefinitely when services fail.

Service TypeTimeout Recommendation
Fast cache (Redis)10-50ms
Database100-500ms
External API1-5s
Long-running jobAsync + polling

Retry Patterns

When to Retry

Exponential Backoff

Exponential backoff increases wait time between retries exponentially (1s, 2s, 4s...) with random jitter to prevent thundering herd. The jitter adds randomness to avoid synchronized retries from multiple clients.

AttemptBase DelayWith Jitter
11s1.0-1.5s
22s2.0-3.0s
34s4.0-6.0s
48s8.0-12.0s
⚠️

Don't retry blindly: Only retry transient errors (timeouts, 5xx). Never retry 4xx errors (they're client mistakes). Implement retry budgets to prevent thundering herd.


Circuit Breaker Pattern

Prevent cascading failures by stopping calls to a failing service.

States

Implementation

The circuit breaker tracks failures and opens the circuit when a threshold is reached. In the closed state (normal), it executes the function. In the open state, it rejects calls immediately. After a timeout, it moves to half-open state to test if the service has recovered.

Circuit Breaker in Action


Bulkhead Pattern

Isolate components so that failure in one doesn't affect others.

PatternPurpose
Connection pool per serviceOne slow DB doesn't block others
Separate thread poolsIsolate computation
Dedicated instancesComplete isolation

Fallback and Graceful Degradation

Fallback Strategies

Code Example

python
def get_product_recommendations(user_id):
    try:
        # Try primary service
        recommendations = ml_service.get_recommendations(user_id)
        return recommendations
    except MLServiceError:
        # Fallback 1: Cached recommendations
        cached = cache.get(f"recs:{user_id}")
        if cached:
            return cached
        
        # Fallback 2: Popular products
        return popular_products()

Graceful Degradation Matrix

Component DownDegradation
RecommendationsShow popular items
ReviewsHide review section
SearchShow cached results
PaymentsQueue for later
AnalyticsBatch process later

Health Checks

Helpful for orchestration systems to know when to route traffic or restart services.

Health Check Levels

LevelChecksUse Case
LivenessIs the process alive?Kubernetes liveness probe
ReadinessCan handle traffic?Kubernetes readiness probe
StartupIs initialization complete?Kubernetes startup probe

Health Check Implementation

python
from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
async def health():
    checks = {
        "database": check_database(),
        "cache": check_cache(),
        "external_api": check_external_api(),
    }
    
    all_healthy = all(checks.values())
    
    return {
        "status": "healthy" if all_healthy else "unhealthy",
        "checks": checks
    }, 200 if all_healthy else 503

Chaos Engineering

Test your reliability by deliberately breaking things.

Chaos ToolWhat It Tests
Chaos MonkeyKill random instances
LitmusKubernetes chaos
GremlinTargeted attacks
AWS Fault Injection SimulatorCloud infrastructure

Start small: Test in staging first. Have rollback plans. Start with non-critical services. Gradually move to production with proper monitoring.


Reliability Checklist

PatternWhen to Use
TimeoutsAlways - prevent infinite waits
RetriesTransient failures, idempotent operations
Circuit BreakerCall external services that might fail
BulkheadIsolate critical resources
FallbackDegrade gracefully when primary fails
Health ChecksHelp orchestrators manage your service
Chaos EngineeringTest your reliability assumptions

What to Remember for Interviews

  1. Timeouts are essential: Always set them; infinite waits are never acceptable
  2. Retry with backoff: Exponential backoff + jitter prevents thundering herd
  3. Circuit breakers: Protect against cascading failures
  4. Graceful degradation: Have fallback strategies for every critical path
  5. Test for failure: Use chaos engineering to verify reliability

Practice: For any system design, ask: "What happens if X fails?" Then implement the appropriate reliability pattern.