Observability and Monitoring: Logs, Metrics, Traces, SLOs, and Alerts
Build production observability using structured logs, RED and USE metrics, distributed tracing, OpenTelemetry, dashboards, alerts, SLOs, SLIs, and health probes.
Monitoring vs Observability
Monitoring tells you whether known things are healthy. Observability helps you understand unknown failure modes by inspecting system outputs: logs, metrics, traces, events, and profiles.
Key idea: Dashboards are not observability by themselves. Observability means you can ask new questions about production behavior without shipping new code every time.
The Three Pillars
| Pillar | Best For |
|---|---|
| Logs | Detailed event context |
| Metrics | Trends, alerts, and dashboards |
| Traces | Request flow across services |
Modern observability also includes profiling, events, real-user monitoring, synthetic checks, and business metrics.
Structured Logging
Structured logs use fields instead of plain text.
{
"level": "error",
"service": "checkout",
"traceId": "4bf92f",
"userId": "u_123",
"orderId": "ord_456",
"message": "payment authorization failed",
"provider": "stripe",
"latencyMs": 382
}
Logging Rules
- Include correlation IDs.
- Log domain identifiers, not only stack traces.
- Avoid secrets and sensitive data.
- Use consistent fields across services.
- Sample noisy success logs.
- Keep error logs actionable.
Logs can become a data leak: Redact tokens, passwords, payment data, and unnecessary personal data before logs leave the service.
Metrics
Metrics are numeric time-series used for dashboards and alerts.
RED Method
For request-driven services:
| Metric | Meaning |
|---|---|
| Rate | Requests per second |
| Errors | Failed request rate |
| Duration | Latency distribution |
USE Method
For resources:
| Metric | Meaning |
|---|---|
| Utilization | How busy the resource is |
| Saturation | How much work is queued |
| Errors | Error events |
Use histograms for latency. Averages hide tail latency.
Distributed Tracing
Tracing follows one request across services.
| Trace Concept | Meaning |
|---|---|
| Trace | End-to-end request journey |
| Span | One operation inside the trace |
| Trace ID | Shared identifier across spans |
| Span attributes | Service, route, DB query, status |
OpenTelemetry is the common standard for instrumentation and exporting traces, metrics, and logs.
SLI, SLO, and SLA
| Term | Meaning |
|---|---|
| SLI | Service Level Indicator, a measured signal |
| SLO | Service Level Objective, a target for an SLI |
| SLA | Contractual commitment, often with penalties |
Example:
SLI: percentage of checkout requests completed under 500 ms and without 5xx
SLO: 99.9% over 30 rolling days
Error Budget
If your SLO is 99.9%, your allowed failure budget is 0.1%. Error budgets help balance reliability work and feature velocity.
Alerting
Good alerts are actionable and tied to user impact.
| Bad Alert | Better Alert |
|---|---|
| CPU > 80% | Checkout p99 latency above SLO for 10 minutes |
| One pod restarted | Error budget burn rate too high |
| Disk 70% full | Disk predicted to fill within 6 hours |
| Any 500 occurred | 5xx rate above threshold by route |
Multi-Window Burn Rate
Burn-rate alerts catch fast incidents and slow reliability leaks.
Health Checks
| Probe | Purpose |
|---|---|
| Liveness | Should the process be restarted? |
| Readiness | Can this instance receive traffic? |
| Startup | Has slow initialization completed? |
Readiness should fail when critical dependencies are unavailable. Liveness should not fail just because a dependency is slow.
Dashboards
Create dashboards by user journey and service ownership.
| Dashboard | Should Show |
|---|---|
| Executive/service health | SLO, error budget, major dependencies |
| API service | RED metrics by route and status |
| Database | connections, slow queries, locks, replication lag |
| Queue | depth, age, processing rate, failures |
| Incident | current deploys, errors, logs, traces |
Dashboards should answer operational questions quickly during incidents.
What to Remember for Interviews
- Logs, metrics, and traces answer different questions.
- Use RED for services and USE for resources.
- Trace IDs connect distributed work: propagate them through every service call.
- SLOs should reflect user experience: not just infrastructure health.
- Alerts should be actionable: page on symptoms, investigate causes with dashboards.
Practice: Design observability for a checkout system. Include SLIs, SLOs, tracing, dashboards, alerts, health probes, and what data you would log during a payment failure.