Scalability & Performance

Observability and Monitoring: Logs, Metrics, Traces, SLOs, and Alerts

Build production observability using structured logs, RED and USE metrics, distributed tracing, OpenTelemetry, dashboards, alerts, SLOs, SLIs, and health probes.

observabilitymonitoringtracingSLOOpenTelemetry

Monitoring vs Observability

Monitoring tells you whether known things are healthy. Observability helps you understand unknown failure modes by inspecting system outputs: logs, metrics, traces, events, and profiles.

Key idea: Dashboards are not observability by themselves. Observability means you can ask new questions about production behavior without shipping new code every time.


The Three Pillars

PillarBest For
LogsDetailed event context
MetricsTrends, alerts, and dashboards
TracesRequest flow across services

Modern observability also includes profiling, events, real-user monitoring, synthetic checks, and business metrics.


Structured Logging

Structured logs use fields instead of plain text.

json
{
  "level": "error",
  "service": "checkout",
  "traceId": "4bf92f",
  "userId": "u_123",
  "orderId": "ord_456",
  "message": "payment authorization failed",
  "provider": "stripe",
  "latencyMs": 382
}

Logging Rules

  1. Include correlation IDs.
  2. Log domain identifiers, not only stack traces.
  3. Avoid secrets and sensitive data.
  4. Use consistent fields across services.
  5. Sample noisy success logs.
  6. Keep error logs actionable.
⚠️

Logs can become a data leak: Redact tokens, passwords, payment data, and unnecessary personal data before logs leave the service.


Metrics

Metrics are numeric time-series used for dashboards and alerts.

RED Method

For request-driven services:

MetricMeaning
RateRequests per second
ErrorsFailed request rate
DurationLatency distribution

USE Method

For resources:

MetricMeaning
UtilizationHow busy the resource is
SaturationHow much work is queued
ErrorsError events

Use histograms for latency. Averages hide tail latency.


Distributed Tracing

Tracing follows one request across services.

Trace ConceptMeaning
TraceEnd-to-end request journey
SpanOne operation inside the trace
Trace IDShared identifier across spans
Span attributesService, route, DB query, status

OpenTelemetry is the common standard for instrumentation and exporting traces, metrics, and logs.


SLI, SLO, and SLA

TermMeaning
SLIService Level Indicator, a measured signal
SLOService Level Objective, a target for an SLI
SLAContractual commitment, often with penalties

Example:

txt
SLI: percentage of checkout requests completed under 500 ms and without 5xx
SLO: 99.9% over 30 rolling days

Error Budget

If your SLO is 99.9%, your allowed failure budget is 0.1%. Error budgets help balance reliability work and feature velocity.


Alerting

Good alerts are actionable and tied to user impact.

Bad AlertBetter Alert
CPU > 80%Checkout p99 latency above SLO for 10 minutes
One pod restartedError budget burn rate too high
Disk 70% fullDisk predicted to fill within 6 hours
Any 500 occurred5xx rate above threshold by route

Multi-Window Burn Rate

Burn-rate alerts catch fast incidents and slow reliability leaks.


Health Checks

ProbePurpose
LivenessShould the process be restarted?
ReadinessCan this instance receive traffic?
StartupHas slow initialization completed?

Readiness should fail when critical dependencies are unavailable. Liveness should not fail just because a dependency is slow.


Dashboards

Create dashboards by user journey and service ownership.

DashboardShould Show
Executive/service healthSLO, error budget, major dependencies
API serviceRED metrics by route and status
Databaseconnections, slow queries, locks, replication lag
Queuedepth, age, processing rate, failures
Incidentcurrent deploys, errors, logs, traces

Dashboards should answer operational questions quickly during incidents.


What to Remember for Interviews

  1. Logs, metrics, and traces answer different questions.
  2. Use RED for services and USE for resources.
  3. Trace IDs connect distributed work: propagate them through every service call.
  4. SLOs should reflect user experience: not just infrastructure health.
  5. Alerts should be actionable: page on symptoms, investigate causes with dashboards.

Practice: Design observability for a checkout system. Include SLIs, SLOs, tracing, dashboards, alerts, health probes, and what data you would log during a payment failure.