System Design — Complete Learning Curriculum

By Pritesh Yadav June 15, 2026 11 min read —

This is a self-study course, not a stack of cheat-sheets. The goal is to take an experienced web developer — someone comfortable shipping a CRUD app against one database — all the way to staff-level reasoning about distributed systems: why linearizability is expensive, why every cross-service “transaction” is a lie you have to engineer around, why the speed of light shows up in your p99, and how to make defensible trade-offs under uncertainty.

Each of the 20 modules is teaching material: it builds intuition first (plain analogies), then precise mechanics, then the failure modes that show up in a 3 a.m. incident or a hard interview question. Modules include back-of-the-envelope math, ASCII diagrams, comparison tables, war stories, Test-Yourself questions, and pointers to the canonical papers. They are meant to be read in order and worked through, not skimmed.

How to use this curriculum

Read in tier order. Later modules assume the vocabulary of earlier ones. The Distributed Systems Core (Tier 3) is the hard, load-bearing middle — don’t skip to the interview module before it.
Do every Test-Yourself question before moving on. If you can’t answer in your own words, re-read — recognition is not understanding.
Build small versions. A toy consistent-hash ring (300 lines), a token-bucket limiter in Redis, a tiny event-sourced ledger that rebuilds state by folding events. You learn replication lag by causing it, not by reading about it.
Read the cited papers. The summaries here compress 30-page papers into paragraphs. Read Dynamo, Raft, and the Spanner paper in full at least once — they’re more readable than their reputation.
Re-read after a real incident. The modules land much harder once you’ve watched a retry storm take down your own service.

Budget roughly one focused evening per module for the fundamentals, two for the Tier 3 distributed-systems modules. The whole curriculum is a multi-week effort — that’s expected.

Suggested learning path

The 20 modules group into five tiers. Within a tier, modules listed as “parallel-OK” can be read in any order; across tiers, finish the earlier tier first.

Tier 1 — Fundamentals (everyone, first)

01 → 02 → 03 → 06 Estimation, networking, API design, and caching. This is the shared vocabulary and the mental math (latency numbers, QPS, bytes) the whole rest of the course leans on. Module 01 is the absolute starting point.

Tier 2 — Data & State

04 → 05 → 07 → 08 How a single database actually works (storage engines, indexes, transactions), how to model data across SQL/NoSQL, then how to scale the serving tier (load balancing, stateless design) and the data tier (replication, partitioning). 08 is the bridge into distributed systems.

Tier 3 — Distributed Systems Core (the advanced heart)

09 → 10 → 11 → 12 → 13 → 14 → 15 The conceptual center of gravity. CAP/PACELC and consistency models (09) and consensus (10) are prerequisites for almost everything after. Distributed transactions/sagas (11), messaging (12), event sourcing + CQRS (13), and stream processing (14) build directly on them. Probabilistic structures (15) is more standalone and can be read in parallel with the others once you have 09.

Tier 4 — Architecture & Operating at Scale

16 → 17 → 18 → 19 Resiliency/rate-limiting (16) and observability/SRE (17) — parallel-OK with each other. Then architecture patterns (18, monolith→microservices→cells) and specialized stores (19, search/geo/time-series/OLAP), which assume Tiers 2 and 3.

Tier 5 — Applied / Interview

20 The 7-step interview framework + ten worked case studies. Read it last — it’s where every prior concept gets exercised under a time budget. (You can skim it first to see where the course is going, then return to it for real.)

 TIER 1 Fundamentals
 ┌──────────────────────────────────────────────┐
 │ 01 Estimation ─► 02 Networking ─► 03 APIs     │
 │        └──────────────────────► 06 Caching    │
 └───────────────────────┬──────────────────────┘
                         ▼
 TIER 2 Data & State
 ┌──────────────────────────────────────────────┐
 │ 04 DB Internals ─► 05 Data Modeling           │
 │ 07 Load Bal/Scaling ─► 08 Replication+Shard ──┼──┐  (08 = bridge)
 └──────────────────────────────────────────────┘  │
                                                    ▼
 TIER 3 Distributed Systems Core  (the hard middle)
 ┌──────────────────────────────────────────────┐
 │ 09 CAP/PACELC ─► 10 Consensus                 │
 │      │                 │                       │
 │      ▼                 ▼                       │
 │ 11 Dist Txns/Sagas   12 Messaging ─► 14 Stream│
 │                         └─► 13 EventSrc/CQRS   │
 │ 15 Probabilistic structures (parallel, needs 09)│
 └───────────────────────┬──────────────────────┘
                         ▼
 TIER 4 Architecture & Operating
 ┌──────────────────────────────────────────────┐
 │ 16 Resiliency/RateLimit ║ 17 Observability/SRE │
 │ 18 Arch Patterns (mono→micro→cells)            │
 │ 19 Specialized (search/geo/tsdb/OLAP)          │
 └───────────────────────┬──────────────────────┘
                         ▼
 TIER 5 Applied
 ┌──────────────────────────────────────────────┐
 │ 20 Case Studies & Interview Framework          │
 └──────────────────────────────────────────────┘

#	Module	What you’ll learn	Difficulty
01	Foundations & Back-of-the-Envelope Estimation	Latency/throughput/storage numbers every engineer should know; sizing a system from user counts; the napkin math that anchors every design	★☆☆☆☆
02	Networking & Protocols	TCP/UDP, TLS, HTTP/1.1→2→3, DNS, the OSI mental model, and where the speed of light bites your latency budget	★★☆☆☆
03	API Design & Service Communication	REST vs gRPC vs GraphQL, versioning, pagination, idempotency, sync vs async communication patterns	★★☆☆☆
04	Database Internals: Storage Engines, Indexes, Transactions	B-trees vs LSM-trees, the WAL, MVCC, isolation levels, how an index actually speeds a query	★★★☆☆
05	Data Modeling: SQL vs NoSQL & Polyglot Persistence	Normalization vs denormalization, document/wide-column/KV/graph models, picking the right store per access pattern	★★★☆☆
06	Caching Deep Dive	Cache patterns (aside/through/behind), eviction, invalidation, TTLs, stampedes, and the layers from CPU to CDN	★★★☆☆
07	Load Balancing, Scaling & Stateless Design	L4 vs L7, balancing algorithms, health checks, externalized state, autoscaling, read-replicas vs sharding, load shedding	★★★☆☆
08	Replication & Partitioning (Sharding)	Single/multi/leaderless replication, sync vs async durability, lag anomalies, consistent hashing, rebalancing, routing	★★★★☆
09	CAP, PACELC & Consistency Models	CAP stated correctly, PACELC, the linearizable→eventual spectrum, quorum math, isolation vs consistency	★★★★☆
10	Consensus & Distributed Coordination	FLP impossibility, Paxos/Raft/ZAB, ZooKeeper/etcd, distributed locks + fencing tokens, leases, clocks	★★★★★
11	Distributed Transactions, Sagas & Idempotency	Why cross-service ACID fails, 2PC/3PC blocking, sagas + compensations, the outbox+CDC fix, delivery semantics	★★★★★
12	Messaging, Queues & Event-Driven Architecture	Queues vs logs, Kafka internals, ordering/delivery guarantees, backpressure, retries, DLQs, exactly-once limits	★★★★☆
13	Event Sourcing & CQRS	State as an append-only event log, folding + snapshots, read projections, versioning/upcasting, when it’s overkill	★★★★☆
14	Stream Processing & Real-Time Systems	Batch vs stream, Lambda vs Kappa, event vs processing time, windowing, watermarks, checkpointing, exactly-once	★★★★★
15	Probabilistic Data Structures & Algorithms at Scale	Bloom/Cuckoo filters, HyperLogLog, Count-Min, t-digest, Merkle trees, consistent hashing, geo indexes, skip lists	★★★★☆
16	Rate Limiting, Resiliency & Fault Tolerance	The five rate-limiters, distributed Redis variants, timeouts, jittered retries, circuit breakers, bulkheads, chaos	★★★★☆
17	Observability, SRE & Operating Systems at Scale	Metrics/logs/traces, RED/USE, tracing + sampling, SLI/SLO/error budgets, burn-rate alerts, canary/blue-green deploys	★★★☆☆
18	Architecture Patterns: Monolith → Microservices → Cells	When (not) to split, DDD bounded contexts, strangler-fig, gateway/mesh, DB-per-service, distributed-monolith trap, cells	★★★★☆
19	Specialized Systems: Search, Geo, Time-Series & Analytics	Inverted indexes + BM25, geohash/quadtree/S2, TSDB compression, columnar OLAP, polyglot persistence via CDC	★★★★☆
20	Case Studies & the System Design Interview Framework	A 7-step design loop + ten worked designs (URL shortener, feed, chat, rideshare, payments…) and their load-bearing decisions	★★★☆☆

Mental models / first principles

These big ideas recur across nearly every module. If you internalize only ten things, make it these:

There is no free lunch — every choice is a trade-off. Latency vs consistency, space vs accuracy, throughput vs durability. “Best” is meaningless without a workload.
You can’t beat the speed of light. Cross-region round trips cost tens of milliseconds no matter how fast your code is. Geography sets the floor on your p99.
Failure is the normal case, not the exception. At scale, something is always broken. Design for partial failure, not for the happy path.
State is the enemy of scaling. Stateless services scale trivially; the hard problems all live where state must be shared, replicated, or agreed upon.
Consistency, availability, and partitions force a choice (CAP/PACELC) — and even without a partition you still trade latency against consistency on every request.
Coordination is expensive; avoid it when you can. Consensus, distributed locks, and linearizable reads cost round trips. The cheapest distributed system is one that needs no agreement.
The dual-write problem is everywhere. You cannot atomically update two systems. The answer is almost always one atomic write plus the outbox/log/CDC pattern.
Idempotency is the antidote to “exactly-once.” True exactly-once delivery doesn’t exist across a network; at-least-once + idempotent handlers is how production gets it.
Indexes (and data layouts) are shaped like queries. The right store/index is the one whose structure mirrors your read pattern — that’s why polyglot persistence exists.
Bound everything: timeouts, retries, queues, blast radius. Unbounded anything becomes a cascading or metastable failure. Backpressure and isolation keep failures local.
Eventual consistency is a UX problem as much as a data problem. Read-your-writes and monotonic reads are often what users actually need — not full linearizability.
Don’t over-engineer. A modular monolith on one Postgres handles more than most teams ever reach. Distribute only when a concrete forcing function demands it.

Canonical resources

The course stands on these. Read the book; read the papers at least once each.

Books

Designing Data-Intensive Applications (Kleppmann) — the companion to Tiers 2–3. If you read one book, read DDIA.
Google SRE Book (+ The SRE Workbook) — backs Module 17; SLOs, error budgets, on-call.
The System Design Primer (GitHub, donnemartin) — breadth review + interview drilling.

Foundational papers

Dynamo (Amazon, 2007) — leaderless replication, quorums, consistent hashing, vector clocks (Modules 08, 09, 15).
Spanner / TrueTime (Google, 2012) — external consistency via bounded-uncertainty clocks (Modules 09, 10).
Raft — In Search of an Understandable Consensus Algorithm (Ongaro & Ousterhout, 2014) (Module 10).
MapReduce (Google, 2004) and the Kafka paper (LinkedIn, 2011) — batch + the log (Modules 12, 14).
Bigtable (Google, 2006) — wide-column/LSM design (Modules 04, 05).
Also worth it: the Paxos Made Simple note, the CAP retrospective (Brewer, 2012), the FLP impossibility result, and the HyperLogLog paper.

Courses

MIT 6.824 Distributed Systems — lectures + labs (build Raft yourself). The single best practical complement to Tier 3.

Self-assessment milestones

You’ve internalized the curriculum when you can honestly check these off:

Welcome in. Start with Module 01 and work forward.

Continue reading

🏗️ System Design

Foundations & Back-of-the-Envelope Estimation

🏗️ System Design

Networking & Protocols

🏗️ System Design

API Design & Service Communication

How to use this curriculum

Suggested learning path

Tier 1 — Fundamentals (everyone, first)

Tier 2 — Data & State

Tier 3 — Distributed Systems Core (the advanced heart)

Tier 4 — Architecture & Operating at Scale

Tier 5 — Applied / Interview

Table of contents

Mental models / first principles

Canonical resources

Self-assessment milestones

Continue reading