Putting It Together — Designing a Real System & Trade-offs

By Pritesh Yadav 13 min read

This is the capstone section. Until now you have learned five separate topics: databases (where data lives), caching (keeping copies of data close by for speed), networking (how machines talk), concurrency (doing many things at once safely), and distribution (spreading work across many machines). In real life you do not use these one at a time. You pull them all together, like levers on a control panel, while designing one system.

The good news: you do not need to memorize famous architectures. You need a repeatable thinking process. Learn the process and you can reason about any system — even one you have never seen before.

Key takeaway: System design is not about knowing the "right answer." It is about making trade-offs explicit and matching them to the requirements. State your assumptions out loud and defend your choices.

The repeatable design process

Follow these seven steps in order, every time:

  1. Clarify functional requirements — what must the system do? (Example: shorten a long web address into a short one, and redirect the short one back.)
  2. Clarify non-functional requirements — how well must it do it? This means scale (how many users), latency (how fast), availability (how often it is up), consistency (how fresh/correct the data must be), and durability (will we lose data).
  3. Back-of-envelope estimate — rough math for QPS, storage, and bandwidth. (QPS = "queries per second," the number of requests hitting the system each second.) This tells you whether one computer is enough or you need many.
  4. Define the API and data model — the contract. What does the client send, what comes back, and how is data stored.
  5. Sketch the high-level design — boxes and arrows: client → load balancer → app servers → cache → database, plus background workers.
  6. Find the bottleneck and scale it — the bottleneck is the slowest, most overloaded part. Usually it is the database or the hot reads.
  7. Harden for reliability — redundancy, graceful degradation, idempotency, monitoring.
Best practice: Always state your assumptions and round aggressively. The goal is the right order of magnitude (is it 10 servers or 1,000?), not a precise number.
Common mistake: Jumping straight to a complex distributed design before clarifying requirements and estimating scale. This is over-engineering — building for millions of users you do not have. A single PostgreSQL database handles far more than beginners expect.

Back-of-envelope estimation: the cheat sheet

"Back-of-envelope" means quick rough math you could do on the back of an envelope. Memorize these handful of facts and you can estimate almost anything.

  • Seconds in a day ≈ 86,400 ≈ 10⁵ (one hundred thousand). So 1 million per day ≈ ~12 per second on average. Flip side: 1 request/second ≈ ~2.5 million/month.
  • Peak ≈ 2–10× the average. Traffic is bursty (everyone shops at lunchtime). Always state your peak multiplier.
  • Bandwidth = average payload size in bytes × QPS. (Multiply by 8 if you want bits/second to quote Mbps.)
  • Storage = records per day × bytes per record × how many days you keep it × replication factor (how many copies you store).
  • Power-of-two sizes: 2¹⁰ ≈ 1 KB, 2²⁰ ≈ 1 MB, 2³⁰ ≈ 1 GB, 2⁴⁰ ≈ 1 TB. One ASCII character = 1 byte; a UUID = 16 bytes.
  • Rough single-node ceilings (sanity bounds, not gospel): a relational database does ~10,000 writes/second and tens of thousands of reads/second per machine; Redis (an in-memory cache) does 100,000+ operations/second.

Worked example A: a URL shortener (read-heavy)

A URL shortener takes a long web address and gives back a short code (like the ones you see in text messages). Click the short code and it redirects you to the original.

Requirements: create a short code from a long URL; redirect a short code to its long URL. Assume ~100 million new URLs per day, and links must work for years.

Estimation:

  • Writes: 100,000,000 ÷ 86,400 ≈ ~1,160 writes/second average. With a peak multiplier of ~10×, call it ~10,000/second at peak.
  • Reads: a URL shortener has a read:write ratio around 100:1 (each link is created once but clicked many times). So ~116,000 reads/second average — much higher at peak.
  • Storage: ~500 bytes per record (long URL + metadata) × 100M/day × 365 days × 5 years ≈ ~90 TB over five years.
  • Short code length: 7 characters using Base62 (the 62 URL-safe characters: a–z, A–Z, 0–9) gives 62⁷ ≈ 3.5 trillion possible codes — decades of headroom. We use Base62, not Base64, because Base64 includes + and /, which have special meaning inside URLs.

Design insight: with 100 reads for every write, the cache is the system. Generate a unique ID (from a global counter, or a Snowflake-style 64-bit ID), Base62-encode it as the short code, and store code → long_url. The redirect is just a primary-key lookup, so cache it hard: a CDN (a network of edge servers near users) plus Redis in front of the database.

Example: A 301 redirect ("moved permanently") lets the browser cache the destination, so repeat clicks never touch your servers — great for load. But then you lose click analytics. A 302 ("found / temporary") keeps every click hitting you, so you can count it — but you carry all the load. That tension (analytics vs. load) is a real trade-off you should surface, not hide.

Worked example B: e-commerce checkout (write-heavy, money on the line)

Browsing a catalog is read-heavy (cache it). Checkout is the hard part because real money, inventory, and outside services are involved. Every topic from earlier shows up here with teeth.

  • Concurrency — the "last item in stock" race: two buyers click "Buy" for the final unit at the same moment. Without protection, both succeed and you oversell. The fix is a database transaction with row locking (SELECT ... FOR UPDATE, which "reserves" the row so the second buyer waits) or an atomic decrement (a single uninterruptible "subtract one"). This is the classic lost update problem made concrete.
  • Consistency vs. availability, per piece: inventory and payment need strong consistency — you must never oversell or double-charge. But the product's review count can be eventually consistent — it is fine if it is a few seconds stale. Different parts of one system sit at different points on the spectrum. That is the key insight.
  • External calls + retries: the payment gateway might time out after it already charged the card. The client retries and — without protection — charges twice. This is exactly why idempotency keys exist (below).
  • Async work: the confirmation email, invoice PDF, and warehouse notification go on a queue (a waiting line of jobs processed by background workers), not on the request path. This keeps checkout fast and survives a downstream outage.
        ┌── CDN (static files, images)
Client ─┤
        └─ LB ─ App servers ─ Cache (Redis) ─ DB (primary + replicas)
                     │
                     └─ Queue ─ Workers (email, PDF, inventory sync)

  LB = load balancer (spreads requests across app servers)
Common mistake: Putting slow or external work (emails, PDFs, third-party calls) on the synchronous request path. This inflates the time the user waits and ties your uptime to someone else's service. Push it to a queue.

Latency numbers every programmer should know

Latency means "how long one thing takes." These rounded numbers (the classic Jeff Dean / Peter Norvig table) drive almost every design decision:

OperationRough time
L1 CPU cache reference~0.5 ns
Main memory (RAM) reference~100 ns
Read 4 KB randomly from SSD~150 µs (150,000 ns)
Round trip inside one datacenter~0.5 ms (500 µs)
Hard-disk (HDD) seek~10 ms
Round trip California ↔ Netherlands~150 ms

The mental model: memory is roughly 100× faster than SSD, which is roughly 100,000× faster than a cross-continent network round trip. This is why we cache in memory and why we avoid chatty cross-region calls.

Analogy: Scale these times up so a human can feel them. If reading from L1 cache (0.5 ns) took 1 second, then reading from RAM (100 ns) would take ~3.5 minutes, reading from SSD (150 µs) would take ~3.5 days, and a cross-continent network round trip (150 ms) would take ~9.5 years. That is why keeping data close (locality) and caching dominate design.

The five universal trade-offs

This is the heart of the section. Almost every design decision is one of these tugs-of-war.

Trade-offWhat pulls each way
Latency vs. throughputLatency = time for one request. Throughput = requests handled per second. Batching many items together raises throughput but makes each item wait longer. Optimizing one can hurt the other.
Consistency vs. availability (CAP)When the network "partitions" (machines can't reach each other), you must choose: refuse to serve possibly-wrong data (CP), or stay up and reconcile later (AP). Modern systems are tunable per operation. PACELC adds: even with no partition, you trade latency vs. consistency.
Cost vs. performanceMore replicas, RAM, and regions buy speed and uptime — but cost money. Going from 99.9% to 99.99% uptime can cost roughly 10× the infrastructure.
Simplicity vs. flexibilityA monolith with one Postgres is easy to reason about. Microservices with many data stores scale teams but add operational and consistency complexity. Do not distribute prematurely.
Read-heavy vs. write-heavyRead-heavy → replicas, caches, denormalize, CDN. Write-heavy → sharding, write-optimized stores (LSM-tree databases), queue-and-batch.
Common mistake: Treating CAP as "pick CP or AP for the whole system." Real systems tune consistency per operation (payment = strong, review count = eventual), and PACELC reminds you that you trade latency vs. consistency even when nothing is broken.

Reliability: making it survive failure

Redundancy — no single point of failure

A "single point of failure" is one part whose death takes down everything. Avoid it: run multiple app servers behind a load balancer, keep database replicas with failover, and spread across availability zones or regions. But redundancy only helps if failover is automatic and tested. An untested standby gives false confidence — and remember the load balancer, the cache, and the database can each themselves be a single point of failure.

Graceful degradation

Under heavy stress, drop non-essential features instead of falling over completely. Serve a slightly stale cached page, hide recommendations, disable the review widget — but keep checkout working. A worse-but-working experience beats a blank error page.

Best practice: Design for the unhappy path too. Every data view needs a loading state, an empty state, and an error state — and the system needs a plan for "what happens when the payment gateway is down."

Idempotency keys and safe retries

An operation is idempotent if doing it twice has the same effect as doing it once. To make "charge the card" idempotent, the client generates an idempotency key (a unique UUIDv4) per logical request and sends it along. The server stores key → result. If a retry arrives with the same key, the server returns the stored result instead of charging again. (Stripe keeps these keys for ~24 hours.)

Retries should use exponential backoff with jitter. The formula:

delay = min(cap, base × 2^attempt) + random_jitter

Example: With base = 200 ms: attempt 1 waits ~200 ms, attempt 2 ~400 ms, attempt 3 ~800 ms — backing off so the struggling server gets breathing room. Jitter is a small random amount added to each delay. Without it, thousands of clients that failed at the same instant would all retry at the exact same moment and crash the recovering server again — the thundering herd problem. Jitter spreads the retries out. Cap the number of attempts (Stripe ≈ 3; others 6–8).
Common mistake: Retrying without an idempotency key (→ duplicate charges/orders), or retrying 4xx client errors. Only retry transient failures (timeouts, 429 "too many requests," 503, 408). A 400 or 401 will never succeed on retry — retrying just wastes effort.

Observability: the three pillars

Observability means being able to understand what your system is doing from the outside. It rests on three pillars:

  • Metrics = numeric time-series (QPS, error rate, p99 latency, CPU). Cheap; perfect for dashboards and alerts. They tell you what is wrong.
  • Logs = timestamped records of events, with rich detail. They tell you why it is wrong.
  • Traces = the journey of one request across all the services it touches, with timing at each hop. They tell you where it is slow or failing.

They work together: metrics alert, traces localize, logs explain. OpenTelemetry (OTel) is the vendor-neutral standard for emitting all three.

Example (a triage walkthrough): A p99-latency metric alert fires. You open a trace of a slow request and see the slow hop is the payment service. You read that service's logs and find "connection pool exhausted." Three pillars, one root cause.
Common mistake: Looking at average latency. The average hides the tail. Watch percentiles instead. "p99 = 800 ms" means 99% of requests are faster than 800 ms and the slowest 1% are worse — and that slow 1% is real users having a bad time.

SLI, SLO, SLA, and error budgets

  • SLI (Service Level Indicator) = the measured number, e.g. "% of requests served under 200 ms."
  • SLO (Service Level Objective) = your internal target for that SLI, e.g. 99.9%.
  • SLA (Service Level Agreement) = a contractual promise to customers, with penalties (refunds) if you miss it. It is usually set looser than the SLO, so you have a safety margin.
  • Error budget = 100% − SLO = how much failure you are allowed. A 99.9% monthly SLO gives ~43 minutes of budget. Spend it on shipping faster and taking risks; when it is used up, freeze risky changes.
AvailabilityDowntime per yearPer month
99%~3.65 days~7.2 h
99.9% ("three nines")~8.76 h~43 min
99.99% ("four nines")~52 min~4.3 min
99.999% ("five nines")~5 min~26 s

Each extra nine roughly costs 10× more to achieve. Pick the level your business actually needs — most products do not need five nines.

How to reason about any system you meet

This is the takeaway skill. When you face a system you have never seen, ask these questions in order:

  1. What does it do?
  2. What is the scale (QPS, storage)?
  3. What is the read:write shape?
  4. Where is the bottleneck?
  5. What does it cache, and what is the invalidation (staleness) story?
  6. What is the consistency requirement for each piece?
  7. What happens when X fails?
  8. How would I know it failed (metrics/alerts)?

If you can answer those eight questions, you can analyze any system. There is rarely one "right" design — only trade-offs made explicit and matched to requirements.

Key takeaways:
  • Use the repeatable process every time: requirements → estimate → API/data model → sketch → find the bottleneck → harden for reliability. State assumptions out loud.
  • The five universal trade-offs (latency/throughput, consistency/availability, cost/performance, simplicity/flexibility, read-heavy/write-heavy) underlie nearly every decision — and consistency is tunable per operation, not for the whole system.
  • Memory ≈ 100× faster than SSD ≈ 100,000× faster than a cross-continent round trip — this is why caching and locality dominate design.
  • Reliability = redundancy with tested automatic failover, graceful degradation, client-generated idempotency keys, and retries with exponential backoff plus jitter (jitter prevents the thundering herd).
  • Observe with the three pillars (metrics alert, traces localize, logs explain) and measure percentiles like p99, never averages. SLI is measured, SLO is your target, SLA is the contract; error budget = 100% − SLO.
  • To analyze any system, ask: scale, read:write shape, bottleneck, caching+invalidation, per-component consistency, failure modes, observability.

Continue reading