Putting It Together — Designing a Real System & Trade-offs
This is the capstone section. Until now you have learned five separate topics: databases (where data lives), caching (keeping copies of data close by for speed), networking (how machines talk), concurrency (doing many things at once safely), and distribution (spreading work across many machines). In real life you do not use these one at a time. You pull them all together, like levers on a control panel, while designing one system.
The good news: you do not need to memorize famous architectures. You need a repeatable thinking process. Learn the process and you can reason about any system — even one you have never seen before.
The repeatable design process
Follow these seven steps in order, every time:
- Clarify functional requirements — what must the system do? (Example: shorten a long web address into a short one, and redirect the short one back.)
- Clarify non-functional requirements — how well must it do it? This means scale (how many users), latency (how fast), availability (how often it is up), consistency (how fresh/correct the data must be), and durability (will we lose data).
- Back-of-envelope estimate — rough math for QPS, storage, and bandwidth. (QPS = "queries per second," the number of requests hitting the system each second.) This tells you whether one computer is enough or you need many.
- Define the API and data model — the contract. What does the client send, what comes back, and how is data stored.
- Sketch the high-level design — boxes and arrows: client → load balancer → app servers → cache → database, plus background workers.
- Find the bottleneck and scale it — the bottleneck is the slowest, most overloaded part. Usually it is the database or the hot reads.
- Harden for reliability — redundancy, graceful degradation, idempotency, monitoring.
Back-of-envelope estimation: the cheat sheet
"Back-of-envelope" means quick rough math you could do on the back of an envelope. Memorize these handful of facts and you can estimate almost anything.
- Seconds in a day ≈ 86,400 ≈ 10⁵ (one hundred thousand). So 1 million per day ≈ ~12 per second on average. Flip side: 1 request/second ≈ ~2.5 million/month.
- Peak ≈ 2–10× the average. Traffic is bursty (everyone shops at lunchtime). Always state your peak multiplier.
- Bandwidth = average payload size in bytes × QPS. (Multiply by 8 if you want bits/second to quote Mbps.)
- Storage = records per day × bytes per record × how many days you keep it × replication factor (how many copies you store).
- Power-of-two sizes: 2¹⁰ ≈ 1 KB, 2²⁰ ≈ 1 MB, 2³⁰ ≈ 1 GB, 2⁴⁰ ≈ 1 TB. One ASCII character = 1 byte; a UUID = 16 bytes.
- Rough single-node ceilings (sanity bounds, not gospel): a relational database does ~10,000 writes/second and tens of thousands of reads/second per machine; Redis (an in-memory cache) does 100,000+ operations/second.
Worked example A: a URL shortener (read-heavy)
A URL shortener takes a long web address and gives back a short code (like the ones you see in text messages). Click the short code and it redirects you to the original.
Requirements: create a short code from a long URL; redirect a short code to its long URL. Assume ~100 million new URLs per day, and links must work for years.
Estimation:
- Writes: 100,000,000 ÷ 86,400 ≈ ~1,160 writes/second average. With a peak multiplier of ~10×, call it ~10,000/second at peak.
- Reads: a URL shortener has a read:write ratio around 100:1 (each link is created once but clicked many times). So ~116,000 reads/second average — much higher at peak.
- Storage: ~500 bytes per record (long URL + metadata) × 100M/day × 365 days × 5 years ≈ ~90 TB over five years.
- Short code length: 7 characters using Base62 (the 62 URL-safe characters: a–z, A–Z, 0–9) gives 62⁷ ≈ 3.5 trillion possible codes — decades of headroom. We use Base62, not Base64, because Base64 includes
+and/, which have special meaning inside URLs.
Design insight: with 100 reads for every write, the cache is the system. Generate a unique ID (from a global counter, or a Snowflake-style 64-bit ID), Base62-encode it as the short code, and store code → long_url. The redirect is just a primary-key lookup, so cache it hard: a CDN (a network of edge servers near users) plus Redis in front of the database.
Worked example B: e-commerce checkout (write-heavy, money on the line)
Browsing a catalog is read-heavy (cache it). Checkout is the hard part because real money, inventory, and outside services are involved. Every topic from earlier shows up here with teeth.
- Concurrency — the "last item in stock" race: two buyers click "Buy" for the final unit at the same moment. Without protection, both succeed and you oversell. The fix is a database transaction with row locking (
SELECT ... FOR UPDATE, which "reserves" the row so the second buyer waits) or an atomic decrement (a single uninterruptible "subtract one"). This is the classic lost update problem made concrete. - Consistency vs. availability, per piece: inventory and payment need strong consistency — you must never oversell or double-charge. But the product's review count can be eventually consistent — it is fine if it is a few seconds stale. Different parts of one system sit at different points on the spectrum. That is the key insight.
- External calls + retries: the payment gateway might time out after it already charged the card. The client retries and — without protection — charges twice. This is exactly why idempotency keys exist (below).
- Async work: the confirmation email, invoice PDF, and warehouse notification go on a queue (a waiting line of jobs processed by background workers), not on the request path. This keeps checkout fast and survives a downstream outage.
┌── CDN (static files, images)
Client ─┤
└─ LB ─ App servers ─ Cache (Redis) ─ DB (primary + replicas)
│
└─ Queue ─ Workers (email, PDF, inventory sync)
LB = load balancer (spreads requests across app servers)
Latency numbers every programmer should know
Latency means "how long one thing takes." These rounded numbers (the classic Jeff Dean / Peter Norvig table) drive almost every design decision:
| Operation | Rough time |
|---|---|
| L1 CPU cache reference | ~0.5 ns |
| Main memory (RAM) reference | ~100 ns |
| Read 4 KB randomly from SSD | ~150 µs (150,000 ns) |
| Round trip inside one datacenter | ~0.5 ms (500 µs) |
| Hard-disk (HDD) seek | ~10 ms |
| Round trip California ↔ Netherlands | ~150 ms |
The mental model: memory is roughly 100× faster than SSD, which is roughly 100,000× faster than a cross-continent network round trip. This is why we cache in memory and why we avoid chatty cross-region calls.
The five universal trade-offs
This is the heart of the section. Almost every design decision is one of these tugs-of-war.
| Trade-off | What pulls each way |
|---|---|
| Latency vs. throughput | Latency = time for one request. Throughput = requests handled per second. Batching many items together raises throughput but makes each item wait longer. Optimizing one can hurt the other. |
| Consistency vs. availability (CAP) | When the network "partitions" (machines can't reach each other), you must choose: refuse to serve possibly-wrong data (CP), or stay up and reconcile later (AP). Modern systems are tunable per operation. PACELC adds: even with no partition, you trade latency vs. consistency. |
| Cost vs. performance | More replicas, RAM, and regions buy speed and uptime — but cost money. Going from 99.9% to 99.99% uptime can cost roughly 10× the infrastructure. |
| Simplicity vs. flexibility | A monolith with one Postgres is easy to reason about. Microservices with many data stores scale teams but add operational and consistency complexity. Do not distribute prematurely. |
| Read-heavy vs. write-heavy | Read-heavy → replicas, caches, denormalize, CDN. Write-heavy → sharding, write-optimized stores (LSM-tree databases), queue-and-batch. |
Reliability: making it survive failure
Redundancy — no single point of failure
A "single point of failure" is one part whose death takes down everything. Avoid it: run multiple app servers behind a load balancer, keep database replicas with failover, and spread across availability zones or regions. But redundancy only helps if failover is automatic and tested. An untested standby gives false confidence — and remember the load balancer, the cache, and the database can each themselves be a single point of failure.
Graceful degradation
Under heavy stress, drop non-essential features instead of falling over completely. Serve a slightly stale cached page, hide recommendations, disable the review widget — but keep checkout working. A worse-but-working experience beats a blank error page.
Idempotency keys and safe retries
An operation is idempotent if doing it twice has the same effect as doing it once. To make "charge the card" idempotent, the client generates an idempotency key (a unique UUIDv4) per logical request and sends it along. The server stores key → result. If a retry arrives with the same key, the server returns the stored result instead of charging again. (Stripe keeps these keys for ~24 hours.)
Retries should use exponential backoff with jitter. The formula:
delay = min(cap, base × 2^attempt) + random_jitter
Observability: the three pillars
Observability means being able to understand what your system is doing from the outside. It rests on three pillars:
- Metrics = numeric time-series (QPS, error rate, p99 latency, CPU). Cheap; perfect for dashboards and alerts. They tell you what is wrong.
- Logs = timestamped records of events, with rich detail. They tell you why it is wrong.
- Traces = the journey of one request across all the services it touches, with timing at each hop. They tell you where it is slow or failing.
They work together: metrics alert, traces localize, logs explain. OpenTelemetry (OTel) is the vendor-neutral standard for emitting all three.
SLI, SLO, SLA, and error budgets
- SLI (Service Level Indicator) = the measured number, e.g. "% of requests served under 200 ms."
- SLO (Service Level Objective) = your internal target for that SLI, e.g. 99.9%.
- SLA (Service Level Agreement) = a contractual promise to customers, with penalties (refunds) if you miss it. It is usually set looser than the SLO, so you have a safety margin.
- Error budget = 100% − SLO = how much failure you are allowed. A 99.9% monthly SLO gives ~43 minutes of budget. Spend it on shipping faster and taking risks; when it is used up, freeze risky changes.
| Availability | Downtime per year | Per month |
|---|---|---|
| 99% | ~3.65 days | ~7.2 h |
| 99.9% ("three nines") | ~8.76 h | ~43 min |
| 99.99% ("four nines") | ~52 min | ~4.3 min |
| 99.999% ("five nines") | ~5 min | ~26 s |
Each extra nine roughly costs 10× more to achieve. Pick the level your business actually needs — most products do not need five nines.
How to reason about any system you meet
This is the takeaway skill. When you face a system you have never seen, ask these questions in order:
- What does it do?
- What is the scale (QPS, storage)?
- What is the read:write shape?
- Where is the bottleneck?
- What does it cache, and what is the invalidation (staleness) story?
- What is the consistency requirement for each piece?
- What happens when X fails?
- How would I know it failed (metrics/alerts)?
If you can answer those eight questions, you can analyze any system. There is rarely one "right" design — only trade-offs made explicit and matched to requirements.
- Use the repeatable process every time: requirements → estimate → API/data model → sketch → find the bottleneck → harden for reliability. State assumptions out loud.
- The five universal trade-offs (latency/throughput, consistency/availability, cost/performance, simplicity/flexibility, read-heavy/write-heavy) underlie nearly every decision — and consistency is tunable per operation, not for the whole system.
- Memory ≈ 100× faster than SSD ≈ 100,000× faster than a cross-continent round trip — this is why caching and locality dominate design.
- Reliability = redundancy with tested automatic failover, graceful degradation, client-generated idempotency keys, and retries with exponential backoff plus jitter (jitter prevents the thundering herd).
- Observe with the three pillars (metrics alert, traces localize, logs explain) and measure percentiles like p99, never averages. SLI is measured, SLO is your target, SLA is the contract; error budget = 100% − SLO.
- To analyze any system, ask: scale, read:write shape, bottleneck, caching+invalidation, per-component consistency, failure modes, observability.