Basic System Concepts: Latency, Throughput, Scaling, Load Balancing, and Caching

Understand the foundational metrics and patterns of distributed systems: latency numbers, vertical vs horizontal scaling, load balancing algorithms, and caching strategies.

system-designfoundationsscalabilitylatencyavailability

The Blueprint of System Design

In system design, every decision is a trade-off. To build systems that scale to millions of users, you must understand the fundamental forces that govern distributed architecture. These aren't just academic concepts—they are the "blueprint" used by companies like Google, Meta, and Amazon to build global-scale infrastructure.

1. The Performance Dial: Latency vs. Throughput

Engineers often use "speed" as a catch-all term, but in distributed systems, speed has two distinct dimensions:

Latency: The time it takes for a single request to complete (measured in ms, µs, ns). It's the "delay."
Throughput: The number of requests a system can handle per unit of time (measured in req/s, QPS, or MB/s). It's the "capacity."

💡

The Highway Analogy: Imagine a highway.

Latency is how long it takes a single car to travel from Point A to Point B.
Throughput is how many cars pass a specific point on the highway in one hour.
You can increase throughput by adding more lanes (Horizontal Scaling), but that doesn't necessarily reduce the time it takes for one car to reach the destination.

Latency Numbers Every Programmer Should Know

To build intuition, you must understand the relative cost of operations. Modern computers are fast, but the gap between "local" and "network" is monumental.

Operation	Latency (Approx)	Analogy (1ns = 1s)
L1 Cache reference	0.5 ns	0.5 seconds
Main memory reference (RAM)	100 ns	1.6 minutes
SSD Random Read	150 µs	1.7 days
Datacenter Round Trip	500 µs	5.8 days
Cross-continent Round Trip	150 ms	4.7 years

✅

Interactive Tool: To explore these concepts interactively, check out our Latency Simulator.

2. The Scaling Spectrum: Vertical vs. Horizontal

When your system hits its limits, you have two primary ways to grow.

Vertical Scaling (Scaling Up)

Adding more power (CPU, RAM) to an existing machine.

Pros: simple, no code changes, low complexity.
Cons: upper hardware limit, single point of failure (SPOF), expensive.
Real World: Instagram famously ran on a single massive PostgreSQL instance for years before sharding.

Horizontal Scaling (Scaling Out)

Adding more machines to the pool.

Pros: No upper limit, resilient (no SPOF), uses commodity hardware.
Cons: Requires a Load Balancer, introduces distributed complexity (consistency, network partitions).
Real World: Airbnb improved the scalability of its web-serving tier by removing bottlenecks and distributing traffic across

3. Reliability & Availability: "The Nines"

A system can be fast and scalable, but it's useless if it's down.

Reliability: The probability that a system will perform its intended function without failure for a specified period.
Availability: The percentage of time a system is operational and accessible.

Availability is measured in "Nines":

Availability %	Downtime per Year	Class
99% ("Two Nines")	3.65 days	Basic
99.9% ("Three Nines")	8.77 hours	Standard SaaS
99.99% ("Four Nines")	52.6 minutes	High Availability
99.999% ("Five Nines")	5.26 minutes	Mission Critical

⚠️

Note: High availability (HA) often requires redundancy. If one component has 99.9% availability, and you need the whole system to be 99.99%, you must design for failover and remove all Single Points of Failure. There is interactive tool to explore these concepts interactively, check out our Availability Simulator.

4. Traffic Distribution: Load Balancing

Once you scale horizontally, you need a Load Balancer (LB) to act as a "traffic cop."

Layer 4 vs. Layer 7

L4 (Transport): Routes based on IP and Port. Extremely fast, but "blind" to the application data.
L7 (Application): Routes based on URLs, Cookies, and Headers. "Smart" but slower.
- Example: Send all /images/* requests to a specialized image server, and /api/* to the backend.

Consistent Hashing

In a distributed system, how do you decide which server gets which piece of data? Simple modulo (key % n) fails when you add or remove servers (it reshuffles everything).

Consistent Hashing maps keys and servers to a logical "ring," ensuring that adding/removing a node only affects 1/n of the keys.

Classic Paper: Amazon's Dynamo Paper.
Case Study: Discord uses this to scale their real-time communication.

5. The Golden Layer: Caching

Caching is the single most effective way to improve performance. By storing frequently accessed data in memory (RAM), you avoid expensive database or disk operations.

Where to Cache?

Client Side: Browser cache (HTTP headers).
CDN (Edge): Static assets (images, JS) cached closer to users via providers like Cloudflare or Fastly.
Application Layer: In-memory stores like Redis or Memcached.

The Cache-Aside Pattern

The most common strategy for general-purpose applications:

Read: Check Cache. If MISS, read from DB, write to Cache, and return.
Write: Update DB first, then invalidate (delete) the cache entry.

Why delete on write? Updating the cache directly can lead to race conditions where stale data is written over fresh data. Deleting ensures the next read will pull the latest source of truth from the database.

Scaling Success Stories

Slack: Scaled their MySQL fleet using Vitess for horizontal sharding.
Google: Manages millions of containers using Borg, the predecessor to Kubernetes.

Summary: The Interview Cheat Sheet

Concept	Key Takeaway
Latency	Network is massive; memorize RAM (100ns) vs Disk (10ms).
Scaling	Start Vertical; go Horizontal when complexity is worth the gain.
Availability	Design for "Four Nines" (52m downtime/yr) by removing SPOFs.
Load Balancing	Use L7 for smart routing; use Consistent Hashing for stateful scaling.
Caching	Cache-Aside (Delete on Write) is your safest default pattern.

Databases — NoSQL Overview: CAP Theorem, Eventual Consistency, and When to Use What

Load Balancing Strategies: L4 vs L7, Algorithms, and Health Checks