Basic System Concepts: Latency, Throughput, Scaling, Load Balancing, and Caching
Understand the foundational metrics and patterns of distributed systems: latency numbers, vertical vs horizontal scaling, load balancing algorithms, and caching strategies.
The Blueprint of System Design
In system design, every decision is a trade-off. To build systems that scale to millions of users, you must understand the fundamental forces that govern distributed architecture. These aren't just academic concepts—they are the "blueprint" used by companies like Google, Meta, and Amazon to build global-scale infrastructure.
1. The Performance Dial: Latency vs. Throughput
Engineers often use "speed" as a catch-all term, but in distributed systems, speed has two distinct dimensions:
- Latency: The time it takes for a single request to complete (measured in ms, µs, ns). It's the "delay."
- Throughput: The number of requests a system can handle per unit of time (measured in req/s, QPS, or MB/s). It's the "capacity."
The Highway Analogy: Imagine a highway.
- Latency is how long it takes a single car to travel from Point A to Point B.
- Throughput is how many cars pass a specific point on the highway in one hour.
- You can increase throughput by adding more lanes (Horizontal Scaling), but that doesn't necessarily reduce the time it takes for one car to reach the destination.
Latency Numbers Every Programmer Should Know
To build intuition, you must understand the relative cost of operations. Modern computers are fast, but the gap between "local" and "network" is monumental.
| Operation | Latency (Approx) | Analogy (1ns = 1s) |
|---|---|---|
| L1 Cache reference | 0.5 ns | 0.5 seconds |
| Main memory reference (RAM) | 100 ns | 1.6 minutes |
| SSD Random Read | 150 µs | 1.7 days |
| Datacenter Round Trip | 500 µs | 5.8 days |
| Cross-continent Round Trip | 150 ms | 4.7 years |
Interactive Tool: To explore these concepts interactively, check out our Latency Simulator.
2. The Scaling Spectrum: Vertical vs. Horizontal
When your system hits its limits, you have two primary ways to grow.
Vertical Scaling (Scaling Up)
Adding more power (CPU, RAM) to an existing machine.
- Pros: simple, no code changes, low complexity.
- Cons: upper hardware limit, single point of failure (SPOF), expensive.
- Real World: Instagram famously ran on a single massive PostgreSQL instance for years before sharding.
Horizontal Scaling (Scaling Out)
Adding more machines to the pool.
- Pros: No upper limit, resilient (no SPOF), uses commodity hardware.
- Cons: Requires a Load Balancer, introduces distributed complexity (consistency, network partitions).
- Real World: Airbnb improved the scalability of its web-serving tier by removing bottlenecks and distributing traffic across
3. Reliability & Availability: "The Nines"
A system can be fast and scalable, but it's useless if it's down.
- Reliability: The probability that a system will perform its intended function without failure for a specified period.
- Availability: The percentage of time a system is operational and accessible.
Availability is measured in "Nines":
| Availability % | Downtime per Year | Class |
|---|---|---|
| 99% ("Two Nines") | 3.65 days | Basic |
| 99.9% ("Three Nines") | 8.77 hours | Standard SaaS |
| 99.99% ("Four Nines") | 52.6 minutes | High Availability |
| 99.999% ("Five Nines") | 5.26 minutes | Mission Critical |
Note: High availability (HA) often requires redundancy. If one component has 99.9% availability, and you need the whole system to be 99.99%, you must design for failover and remove all Single Points of Failure. There is interactive tool to explore these concepts interactively, check out our Availability Simulator.
4. Traffic Distribution: Load Balancing
Once you scale horizontally, you need a Load Balancer (LB) to act as a "traffic cop."
Layer 4 vs. Layer 7
- L4 (Transport): Routes based on IP and Port. Extremely fast, but "blind" to the application data.
- L7 (Application): Routes based on URLs, Cookies, and Headers. "Smart" but slower.
- Example: Send all
/images/*requests to a specialized image server, and/api/*to the backend.
- Example: Send all
Consistent Hashing
In a distributed system, how do you decide which server gets which piece of data? Simple modulo (key % n) fails when you add or remove servers (it reshuffles everything).
Consistent Hashing maps keys and servers to a logical "ring," ensuring that adding/removing a node only affects 1/n of the keys.
- Classic Paper: Amazon's Dynamo Paper.
- Case Study: Discord uses this to scale their real-time communication.
5. The Golden Layer: Caching
Caching is the single most effective way to improve performance. By storing frequently accessed data in memory (RAM), you avoid expensive database or disk operations.
Where to Cache?
- Client Side: Browser cache (HTTP headers).
- CDN (Edge): Static assets (images, JS) cached closer to users via providers like Cloudflare or Fastly.
- Application Layer: In-memory stores like Redis or Memcached.
The Cache-Aside Pattern
The most common strategy for general-purpose applications:
- Read: Check Cache. If MISS, read from DB, write to Cache, and return.
- Write: Update DB first, then invalidate (delete) the cache entry.
Why delete on write? Updating the cache directly can lead to race conditions where stale data is written over fresh data. Deleting ensures the next read will pull the latest source of truth from the database.
Scaling Success Stories
- Slack: Scaled their MySQL fleet using Vitess for horizontal sharding.
- Google: Manages millions of containers using Borg, the predecessor to Kubernetes.
Summary: The Interview Cheat Sheet
| Concept | Key Takeaway |
|---|---|
| Latency | Network is massive; memorize RAM (100ns) vs Disk (10ms). |
| Scaling | Start Vertical; go Horizontal when complexity is worth the gain. |
| Availability | Design for "Four Nines" (52m downtime/yr) by removing SPOFs. |
| Load Balancing | Use L7 for smart routing; use Consistent Hashing for stateful scaling. |
| Caching | Cache-Aside (Delete on Write) is your safest default pattern. |