Observability and Monitoring: Logs, Metrics, Traces, SLOs, and Alerts

Build production observability using structured logs, RED and USE metrics, distributed tracing, OpenTelemetry, dashboards, alerts, SLOs, SLIs, and health probes.

observabilitymonitoringtracingSLOOpenTelemetry

Monitoring vs Observability

Monitoring tells you whether known things are healthy. Observability helps you understand unknown failure modes by inspecting system outputs: logs, metrics, traces, events, and profiles.

✅

Key idea: Dashboards are not observability by themselves. Observability means you can ask new questions about production behavior without shipping new code every time.

The Three Pillars

Pillar	Best For
Logs	Detailed event context
Metrics	Trends, alerts, and dashboards
Traces	Request flow across services

Modern observability also includes profiling, events, real-user monitoring, synthetic checks, and business metrics.

Structured Logging

Structured logs use fields instead of plain text.

json

{
  "level": "error",
  "service": "checkout",
  "traceId": "4bf92f",
  "userId": "u_123",
  "orderId": "ord_456",
  "message": "payment authorization failed",
  "provider": "stripe",
  "latencyMs": 382
}

Logging Rules

Include correlation IDs.
Log domain identifiers, not only stack traces.
Avoid secrets and sensitive data.
Use consistent fields across services.
Sample noisy success logs.
Keep error logs actionable.

⚠️

Logs can become a data leak: Redact tokens, passwords, payment data, and unnecessary personal data before logs leave the service.

Metrics

Metrics are numeric time-series used for dashboards and alerts.

RED Method

For request-driven services:

Metric	Meaning
Rate	Requests per second
Errors	Failed request rate
Duration	Latency distribution

USE Method

For resources:

Metric	Meaning
Utilization	How busy the resource is
Saturation	How much work is queued
Errors	Error events

Use histograms for latency. Averages hide tail latency.

Distributed Tracing

Tracing follows one request across services.

Trace Concept	Meaning
Trace	End-to-end request journey
Span	One operation inside the trace
Trace ID	Shared identifier across spans
Span attributes	Service, route, DB query, status

OpenTelemetry is the common standard for instrumentation and exporting traces, metrics, and logs.

SLI, SLO, and SLA

Term	Meaning
SLI	Service Level Indicator, a measured signal
SLO	Service Level Objective, a target for an SLI
SLA	Contractual commitment, often with penalties

Example:

txt

SLI: percentage of checkout requests completed under 500 ms and without 5xx
SLO: 99.9% over 30 rolling days

Error Budget

If your SLO is 99.9%, your allowed failure budget is 0.1%. Error budgets help balance reliability work and feature velocity.

Alerting

Good alerts are actionable and tied to user impact.

Bad Alert	Better Alert
CPU > 80%	Checkout p99 latency above SLO for 10 minutes
One pod restarted	Error budget burn rate too high
Disk 70% full	Disk predicted to fill within 6 hours
Any 500 occurred	5xx rate above threshold by route

Multi-Window Burn Rate

Burn-rate alerts catch fast incidents and slow reliability leaks.

Health Checks

Probe	Purpose
Liveness	Should the process be restarted?
Readiness	Can this instance receive traffic?
Startup	Has slow initialization completed?

Readiness should fail when critical dependencies are unavailable. Liveness should not fail just because a dependency is slow.

Dashboards

Create dashboards by user journey and service ownership.

Dashboard	Should Show
Executive/service health	SLO, error budget, major dependencies
API service	RED metrics by route and status
Database	connections, slow queries, locks, replication lag
Queue	depth, age, processing rate, failures
Incident	current deploys, errors, logs, traces

Dashboards should answer operational questions quickly during incidents.

What to Remember for Interviews

Logs, metrics, and traces answer different questions.
Use RED for services and USE for resources.
Trace IDs connect distributed work: propagate them through every service call.
SLOs should reflect user experience: not just infrastructure health.
Alerts should be actionable: page on symptoms, investigate causes with dashboards.

✅

Practice: Design observability for a checkout system. Include SLIs, SLOs, tracing, dashboards, alerts, health probes, and what data you would log during a payment failure.

Rate Limiting and Throttling: Protecting Systems Under Load

Content Delivery and Edge Computing: CDN, Geo-DNS, and Edge Caching