Distributed Systems — Many Computers Working as One
A distributed system is a group of separate computers that work together over a network (the wires and wireless links that carry data between machines) so that, to the person using it, they look and behave like a single system. Each computer is called a node. When you open Netflix, search Google, or check your bank balance, you are talking to thousands of nodes pretending to be one.
The computer scientist Leslie Lamport summed up the pain perfectly: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." That sentence is the whole chapter in disguise. Let's unpack why we build these systems anyway, and why they are so hard.
Why build a distributed system at all?
Running on one machine is simple. We give up that simplicity for three concrete reasons.
- Scale — one machine has a fixed amount of CPU (its "thinking power"), RAM (its short-term memory), and disk (its storage). When your traffic or data grows past the biggest single machine money can buy, you spread the work across many machines. This is called horizontal scaling: add more boxes instead of buying one giant box.
- Reliability / fault tolerance — one machine is a single point of failure: when it dies, everything dies. With copies of your service on many machines, the system survives one machine dying. Oddly, more machines means more total failures happen — but the system keeps running through them.
- Geography / latency — latency means delay. A user in Tokyo and a user in London both want fast replies. Data cannot travel faster than light, so a round trip around the world always takes hundreds of milliseconds. Putting servers near each region cuts that delay.
The one new problem that changes everything: partial failure
On a single machine, things either work or the whole thing crashes. Failure is total and obvious. In a distributed system the defining new problem is partial failure: some nodes work, some don't — and you cannot tell which.
Here is the heart of the difficulty. You send a request to another node and hear nothing back. There are four possible reasons, and they all look identical from where you sit:
- Your request never arrived (it got lost on the network).
- It arrived and was processed, but the reply got lost coming back.
- The other node is just slow and is still working on it.
- The other node is dead.
Network partitions and timeouts
A network partition is when the network splits so that two groups of healthy nodes can no longer talk to each other. Each group is fine; they just can't reach the other side. Each side may wrongly believe the other side has died.
Group A (healthy) Group B (healthy)
+-----------+ broken +-----------+
| A1 A2 |----- X ------| B1 B2 |
+-----------+ link +-----------+
| |
"B must be dead" "A must be dead"
(both are wrong - the LINK died, not the nodes)
Your only practical tool for detecting failure is a timeout: "if I don't hear back in X seconds, assume it's dead." But a timeout is only a guess:
- Too short → you wrongly declare healthy-but-slow nodes dead (a false positive) and retry needlessly, sometimes flooding the system.
- Too long → you hang for ages waiting on nodes that are truly dead.
There is no perfect timeout. In fact, deciding "is that node down?" in an asynchronous network (one with no guaranteed delivery time) is provably impossible to do perfectly. A famous 1985 result, the FLP impossibility (named after Fischer, Lynch, and Paterson), proves that no algorithm can guarantee all nodes reach agreement in an asynchronous network if even one node may fail. Real systems get around this with timeouts and a bit of luck — they work in practice, just not with a mathematical guarantee.
The 8 Fallacies of Distributed Computing
These are false beliefs developers fall into when they treat the network like it's local and perfect. Peter Deutsch (at Sun Microsystems, around 1994) wrote the first seven; James Gosling added the eighth.
| # | Fallacy | Reality |
|---|---|---|
| 1 | The network is reliable | Packets drop, cables get cut. |
| 2 | Latency is zero | Remote calls are far slower than local ones. |
| 3 | Bandwidth is infinite | Big payloads choke the link. |
| 4 | The network is secure | Assume it's hostile; encrypt and authenticate. |
| 5 | Topology doesn't change | Nodes, routes, and IPs change constantly. |
| 6 | There is one administrator | Many teams and configs; coordination is hard. |
| 7 | Transport cost is zero | Moving data costs money and CPU. |
| 8 | The network is homogeneous | Mixed hardware, protocols, and versions are normal. |
CAP theorem in plain words
Eric Brewer proposed the CAP theorem in 2000; Gilbert and Lynch proved it in 2002. It names three properties:
- C — Consistency (specifically linearizability): every read sees the most recent write. All nodes agree on one current value, as if there were a single copy.
- A — Availability: every request to a node that hasn't failed gets a real (non-error) answer, even if that answer might be slightly out of date.
- P — Partition tolerance: the system keeps working even when messages between nodes are dropped or delayed.
People say "pick 2 of 3," but here's the honest framing: in any real network, partitions will happen. So P is not optional. You don't get to "pick CA." The real choice is what to do during a partition:
- CP — refuse to answer rather than risk giving wrong data.
- AP — keep answering with possibly-stale data rather than go down.
PACELC — the fuller picture
CAP only describes a partition and ignores the everyday cost of consistency. Daniel Abadi's PACELC (2010) fills the gap. Read it as: if there's a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency.
The insight: even on a perfectly healthy network, keeping copies strongly consistent requires extra coordination round-trips, and those add delay. Strong consistency always costs latency.
| Class | Meaning | Examples |
|---|---|---|
| PA/EL | Stay available in a partition; stay fast normally; weaker consistency | Cassandra, Riak |
| PC/EC | Strong consistency always, paying latency | Google Spanner, CockroachDB, classic ACID SQL |
| PA/EC | Available under partition; consistent under normal ops | MongoDB (default) |
Consistency models — a spectrum, not a yes/no
Consistency isn't all-or-nothing. It runs from strongest (easiest to reason about, most expensive) to weakest (cheapest and fastest, hardest to reason about).
- Strong / linearizable — behaves as if there's one single copy; a read always returns the latest committed write. Simplest to think about, highest coordination cost.
- Causal — operations with a cause-and-effect link (like a reply to a comment) appear in that order everywhere. Unrelated operations may be seen in different orders on different nodes. A nice middle ground: it keeps things that "make sense together" in order without forcing global agreement.
- Eventual — if writes stop, all copies eventually settle on the same value. Until then, reads may be stale or out of order. Cheapest and most available.
Session guarantees make eventual consistency feel sane for one user:
- Read-your-writes — after you change something, you always see your own change (you post a comment and immediately see it).
- Monotonic reads — once you've seen a value, you never see an older one (no "going back in time").
- Monotonic writes — your writes are applied in the order you made them.
Replication, quorums, and consensus
Replication means keeping copies of data on several nodes — for durability (you don't lose it if one disk dies), availability, and faster reads. The hard part is keeping the copies agreeing while nodes and networks fail.
A quorum is a voting trick. With N copies, require W nodes to confirm a write and R nodes to serve a read. If R + W > N, the read set and write set must overlap by at least one node — so a read is guaranteed to see the latest write.
N = 3 replicas W = 2 (write reaches 2) R = 2 (read asks 2)
Since 2 + 2 = 4 > 3, the sets MUST share a node.
Write went to: (n1, n2)
Read asks: (n2, n3)
^^ overlap => read sees the fresh value
A majority quorum means more than half: ⌊N/2⌋+1. It tolerates a minority failing and prevents two separate groups from both committing — which would cause split-brain (two halves each thinking they're in charge, accepting conflicting writes).
Consensus is the act of making many nodes agree on a single ordered sequence of values — a replicated log (an append-only list of operations, in the same order on every node) — despite crashes.
- Paxos (Leslie Lamport) — correct and foundational, but famously hard to understand and build.
- Raft (Ongaro and Ousterhout, 2014) — designed to be understandable. At any time there's one leader; nodes are leader, follower, or candidate. Leader election uses randomized timeouts: whoever times out first becomes a candidate, asks for votes, and wins with a majority. Log replication: clients send to the leader, which appends the entry and forwards it to followers; once a majority confirms, the entry is committed. If the leader dies, a new election runs under a new term (a numbered period of leadership).
Follower --(no heartbeat, times out)--> Candidate Candidate --(asks for votes, gets a majority)--> Leader Leader --(sends heartbeats)--> Followers Leader crashes --> a Follower times out --> new election (term+1)
Raft powers etcd, Consul, CockroachDB, and TiKV; Paxos-family algorithms underpin Google's Chubby/Spanner and ZooKeeper. Consensus needs a majority alive, so it tolerates ⌊(N−1)/2⌋ failures: 3 nodes tolerate 1, 5 nodes tolerate 2.
Retries, idempotency, and delivery guarantees
Because the network is unreliable, clients retry. But a retry can duplicate work: maybe your first request did succeed and only the reply got lost. The fix is idempotency: doing the operation twice has the same effect as doing it once.
SET balance = 100 run twice still leaves 100 — safe to retry. But ADD 100 to balance run twice adds 200 — a lost-reply retry just double-charged the customer. The cure is an idempotency key: the client attaches a unique ID (say order-abc123); the server records which IDs it has processed and ignores duplicates.Delivery guarantees describe how hard a system tries to deliver a message:
- At-most-once — never retry; may lose messages. Simple but lossy.
- At-least-once — retry until confirmed; may deliver duplicates. The common, practical default.
- Exactly-once delivery over an unreliable channel is impossible.
What real systems actually achieve is exactly-once processing = at-least-once delivery + idempotent processing on the receiving end. When a product says "exactly-once," this combination is what it means — never true network-level exactly-once.
Clocks and ordering — why wall clocks lie
It's tempting to order events across machines by their timestamps. Don't. Each machine's wall clock (its real-time-of-day clock) drifts. NTP (the protocol that syncs clocks over the internet) can even jump a clock backward to correct it. There is no single global "now." Comparing two timestamps from two servers can give the wrong order and silently lose data.
Instead we use logical clocks that order by cause-and-effect, not by time:
- Lamport timestamps — each node keeps a counter, bumps it on every event, and on receiving a message sets
counter = max(local, received) + 1. Guarantee: if A caused B, then A's number is smaller. But the reverse isn't true — a smaller number does not prove a causal link. Lamport clocks give a total order but lose information about which events were truly simultaneous. - Vector clocks — each node tracks a whole list of counters, one per node. By comparing two vectors you can tell whether two events are causally ordered or genuinely concurrent (neither caused the other). This is how systems detect conflicting concurrent writes that must be reconciled. The cost: the vector grows with the number of nodes.
[A:2, B:1, C:0] versus [A:1, B:2, C:0] reveal that neither came before the other — they're concurrent and conflict, so a human or a merge rule must resolve them.The rare exception is Google Spanner's TrueTime, which tames physical clocks using GPS and atomic clocks, then deliberately waits out a small uncertainty window to give globally consistent timestamps.
Distributed transactions: 2PC versus Sagas
A transaction is a group of operations that must all succeed or all fail together (its ACID guarantee). Inside one database this is easy. Spanning several services or databases — a distributed transaction — is hard.
Two-Phase Commit (2PC) uses a coordinator:
- Phase 1 (prepare/vote) — the coordinator asks every participant "can you commit?" Each locks its resources and replies yes or no.
- Phase 2 (commit/abort) — if all said yes, tell everyone to commit; otherwise tell everyone to abort.
2PC gives atomicity but is blocking: participants hold locks while waiting, and if the coordinator crashes after the vote, participants can be stuck holding those locks indefinitely. The coordinator is a single point of failure. This is a poor fit for microservices because it couples services together and kills availability and throughput.
The Saga pattern takes a different path. Instead of one global transaction, run a sequence of local transactions, each committing on its own in its own service. If a later step fails, run compensating transactions that semantically undo the earlier ones — rather than rolling everything back at once.
Sagas come in two flavors: choreography (each service reacts to the others' events, no central brain) and orchestration (one central coordinator drives the steps). The trade-off versus 2PC: no global locks, far more scalable and available — but you give up isolation. Other operations can briefly see half-finished state, and every step needs a real undo action.
- A distributed system makes many computers act as one — for scale, fault tolerance, and low latency — but the network is unreliable and you can't tell "slow" from "dead." That ambiguity (partial failure) is the root of every hard problem.
- Partitions are unavoidable, so CAP's real choice is C-vs-A during a partition (CP refuses, AP stays up); PACELC adds that even normally, strong consistency costs latency.
- Consistency is a spectrum (strong → causal → eventual); session guarantees like read-your-writes make eventual consistency feel right to a single user.
- Quorums (R + W > N) and majority consensus (Raft, Paxos) keep replicas agreeing and prevent split-brain; use odd node counts so N=3 tolerates 1 failure, N=5 tolerates 2.
- Networks force retries, so make operations idempotent (idempotency keys); true exactly-once delivery is impossible — settle for at-least-once + idempotent processing.
- Don't order events by wall clocks; use logical/vector clocks for causality. For cross-service transactions, prefer Sagas with compensations over blocking 2PC.