Why Distributed Systems Are Hard
In the first section you learned what a distributed system is: a group of separate computers that work together over a network but appear, to the user, as one system. This section answers a harder question: why is building one so difficult? If you have only ever written ordinary programs that run on a single machine, the rules you have relied on your whole career quietly stop being true. The goal of this section is to show you exactly which rules break, why they break, and what that forces you to do differently.
Let us define a few words up front, in plain English, because we will use them constantly.
- Node — one computer (or one running process) that is part of the system. A node could be a server, a database replica, or a microservice.
- Network — the wires, routers, and software that carry messages between nodes. Think of it as the postal service between computers.
- Message — a chunk of data one node sends to another (a request, a reply, a heartbeat).
- Process — a running program. On one machine you may have many.
- State — the data a node currently holds in memory or on disk (account balances, user sessions, counters).
On a single machine, all your code shares the same memory, the same clock, and the same fate: if the program crashes, everything stops together. A distributed system throws away all three of those comforts. That is the root of every difficulty below.
2.1 Partial failure: the defining problem
What it is. Partial failure means that some parts of the system fail while other parts keep running. One server out of ten dies, or one network link goes down, but the rest of the system carries on, unaware.
Why it exists. On a single computer, failure is total. If the power dies, your whole program dies; there is no awkward in-between state where half your variables are alive and half are dead. But in a distributed system the parts are physically separate. They have separate power supplies, separate memory, separate operating systems. So nothing forces them to fail together. The normal case is that one piece breaks and the others do not even notice yet.
How it works, step by step. Imagine node A asks node B to "save this order."
Node A Node B
| |
|---- "save order #42" -------->| (1) request travels over network
| |
| [B saves it] (2) B does the work
| |
| X <----- "done!" ----------| (3) reply is LOST on the way back
| |
(A waits... and waits...) | (4) A never hears back
Look at A's situation in step 4. A sent a request and heard nothing. A cannot tell which of these happened:
- The request never reached B (so the order was not saved).
- B saved the order but the reply was lost (so the order was saved).
- B is just slow and the reply is still coming (so the order will be saved).
- B crashed completely (so… maybe saved, maybe not, depending on timing).
This is the cruel heart of distributed systems: from the outside, a node that has crashed looks identical to a node that is merely slow. Silence tells you nothing. There is no signal that says "I am definitely dead." A dead node and a node stuck behind a slow network both produce the exact same thing: no answer.
Why it matters. In single-machine code you almost never write "what if assigning this variable only half-worked?" — it cannot half-work. In distributed code, every single remote interaction can half-work, and you must have a plan for it. Ignoring partial failure does not make it go away; it just means your system will corrupt data or hang the first time a packet is lost in production.
2.2 The network is unreliable
What it is. The network — the thing that carries your messages — is not a reliable pipe. Messages can be lost (never arrive), delayed (arrive much later than expected), duplicated (arrive more than once), or reordered (arrive in a different order than you sent them).
Why it exists. A network is not a single wire. It is a long chain of cables, switches, and routers, often spanning buildings, cities, or continents. Each hop has buffers that can overflow and drop packets. Routers reboot. Cables get cut. Software retransmits, sometimes producing a duplicate. There is no physical law guaranteeing that what you send arrives, arrives once, or arrives in order.
Here are the four failure modes drawn out:
SENT (in order): [M1] [M2] [M3] [M4] LOST: [M1] [--] [M3] [M4] M2 vanished DELAYED: [M1] [M3] [M4] ......... [M2] M2 arrives very late DUPLICATED: [M1] [M2] [M2] [M3] M2 arrived twice REORDERED: [M1] [M3] [M2] [M4] M3 overtook M2
How it works, step by step. Say you send "withdraw $100" to a bank node. The network drops it. You see no response. You assume it failed and send it again. This time it arrives — but so does the first one, which was only delayed, not lost. The account is now debited $200. The unreliable network turned one intended action into two.
2.3 No global clock: there is no shared "now"
What it is. Every machine has its own clock, and no two clocks agree exactly. There is no single, shared, authoritative "now" that all nodes can read. Worse, clocks drift — they speed up or slow down slightly over time — so the disagreement changes constantly.
Why it exists. A clock is a physical oscillator (a tiny vibrating crystal). No two crystals vibrate at exactly the same rate; temperature and age change the rate. So each machine's clock slowly wanders away from the others. Systems try to fix this with NTP (Network Time Protocol — software that periodically asks a time server "what time is it?" and nudges the local clock toward it). But NTP itself runs over the unreliable, variable-delay network, so it can only get clocks close, not identical. Clocks can be off by milliseconds normally, and by seconds or more when something goes wrong.
Why it matters. A huge amount of ordinary programming secretly relies on a shared clock. "Which write happened last?" "Did this token expire?" "Order these events by timestamp." On one machine those questions are trivial — there is one clock. Across machines, comparing two timestamps from two different clocks is meaningless, because the clocks may disagree by more than the time gap you are trying to measure.
Real time: ----t1----t2----------------------> Node A clock: reads 10.000s at t1 (A's clock is 30ms FAST) Node B clock: reads 9.950s at t2 (B's clock is 20ms SLOW) Event X happens on A, timestamped 10.000 Event Y happens on B LATER, timestamped 9.950 Naive sort by timestamp says: Y before X. WRONG — Y came AFTER X.
2.4 Concurrency: many things happen at once
What it is. Concurrency means multiple things are happening at the same time, on different nodes, with no built-in agreement about their order. Two users, on two servers, can act on the same data in the same instant.
Why it exists. The whole point of a distributed system is that many nodes run in parallel — that is how you serve millions of users. But parallel means simultaneous, and simultaneous means there is no obvious "this happened before that." On one machine you can wrap shared data in a lock and force one-at-a-time access cheaply, because everything shares memory. Across machines there is no shared memory to lock; coordinating order now requires sending messages over the unreliable network, which is slow and can fail.
How it works, step by step. Two customers buy the last concert ticket at the same moment:
Time --> User A --> Server 1: "tickets left?" reads 1 User B --> Server 2: "tickets left?" reads 1 (both read at once!) User A --> Server 1: "buy 1" writes 0 User B --> Server 2: "buy 1" writes 0 RESULT: two tickets sold, one ticket existed. Oversold.
Each server, looking only at itself, behaved correctly. The bug only exists in the interleaving of the two — and that interleaving is invisible to either node. This is why distributed systems need explicit coordination tools (locks held in a service like etcd or ZooKeeper, or database transactions) to serialize access to shared resources.
2.5 The Two Generals Problem: you can never be 100% sure
What it is. The Two Generals Problem is a famous thought experiment that proves a startling fact: over an unreliable channel, two parties can never become completely certain they agree. It was the first computer-communication problem ever proven to be unsolvable.
The setup. Two generals, A and B, are camped on opposite hills with the enemy in the valley between them. They will win only if they attack at the same time. If just one attacks alone, that army is destroyed. They can communicate only by sending a messenger through the valley — and the messenger may be captured, so any message may be lost.
General A ))))) valley (enemy may capture messenger) ((((( General B
| |
|---- "Attack at dawn!" --------------------------------->| may be lost
| |
|<--- "Agreed — at dawn!" (ack) --------------------------| may be lost
| |
|---- "Got your ack!" (ack of the ack) ----------------->| may be lost
| ... and so on, forever ... |
Why it is unsolvable. Suppose A sends "Attack at dawn." A cannot act on it until A knows B received it — so B must send an acknowledgement (an ack, a "got it" reply). But that ack might be lost, so B cannot be sure A got the ack — so A must ack the ack. But that might be lost… Every message needs a confirmation, and every confirmation needs a confirmation. There is no last message you can fully trust, because the last message might be the one that gets lost. No finite number of messages ever produces certainty. The proof is by contradiction: take any protocol that supposedly works, delete its final message — if the protocol still works, that message was unnecessary, so remove it too; repeat until no messages remain, which obviously cannot coordinate an attack. Contradiction. So no such protocol exists.
Why it matters in practice. This is not just a puzzle. Every time your service calls another service, you are a general sending a messenger. When you get no reply, you are in exactly the general's dilemma: did they get it or not? You cannot achieve certainty. What you do instead is settle for being probably sure and designing so that the remaining doubt is harmless. The three tools below are how we live with that doubt.
2.6 The consequences: timeouts, retries, idempotency
The problems above force three coping tools into almost every distributed system. Learn the words now; you will see them everywhere.
Timeout
What it is. A timeout is a maximum time you are willing to wait for a reply before giving up and assuming something went wrong. "If B has not answered in 5 seconds, stop waiting."
Why we need it. Because of partial failure (§2.1), a node may never answer. Without a timeout, your code would wait forever, holding memory and connections, until it too falls over. The hard part: a timeout cannot tell you whether the work happened. It only tells you that you stopped waiting. A timed-out request may still have succeeded on the other side.
Retry
What it is. A retry is simply sending the request again after a timeout or error, in the hope it works this time.
Why we need it. The network loses messages (§2.2), and many failures are temporary (a brief blip, an overloaded server). Retrying recovers from those. The danger: remember the "withdraw $100 twice" story. If the first request actually succeeded but its reply was lost, your retry does the work a second time. Retries turn lost-reply situations into duplicate-action situations — which is why we need the next tool.
Idempotency
What it is. An operation is idempotent if doing it twice has the same effect as doing it once. "Set status to PAID" is idempotent — running it five times still just means PAID. "Add $100 to the balance" is not idempotent — running it five times adds $500.
Why we need it. Because retries (and duplicated network messages) mean the same request will arrive more than once. If your operations are idempotent, duplicates are harmless. The common technique is an idempotency key: the client attaches a unique ID to the request; the server remembers which IDs it has already processed and ignores repeats.
Idempotency-Key header. If your code times out and retries a charge, Stripe sees the repeated key and returns the original result instead of charging the customer twice. This single idea is what makes it safe to retry over an unreliable network. The same pattern protects message systems: Apache Kafka added an "idempotent producer" so that a producer retrying after a network hiccup does not write the same record twice.
Without idempotency: With idempotency key "k9": send "charge $50" --X (lost reply) send "charge $50" key=k9 --X (lost reply) timeout -> retry "charge $50" timeout -> retry key=k9 RESULT: charged $100 (BUG) server: "seen k9 already" -> $50 once (SAFE)
2.7 "A Note on Distributed Computing" (Waldo et al., 1994)
In November 1994, four engineers at Sun Microsystems — Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall — published a now-famous paper called "A Note on Distributed Computing." It is one of the most important warnings in the field, and its message is simple:
A remote call is NOT just a local function call that happens to go over a wire. Pretending it is will eventually destroy your system.
The problem it attacks. In the early 1990s, a popular idea was transparency: make calling a function on another machine look exactly like calling a function in your own program, so developers do not have to think about the network. This was sold as a convenience. The paper argues it is a trap. (The technique of making a remote call look local is called RPC — Remote Procedure Call. It is genuinely useful, but the illusion of sameness is the danger.)
The four fundamental differences the paper says you can never hide:
| Dimension | Local call (same machine) | Remote call (over network) |
|---|---|---|
| Latency (how long it takes) | Nanoseconds. Effectively free. | Hundreds of microseconds to hundreds of milliseconds — thousands to millions of times slower. A loop that was instant locally can take minutes remotely. |
| Memory access | You share one address space; a pointer on one side means something on the other. | No shared memory. A pointer is meaningless across machines; data must be copied (serialized) and sent. References do not transfer. |
| Partial failure | Either the whole program runs or it crashes — together. | The other side can be dead, slow, or unreachable while you keep running, and you cannot tell which (see §2.1). |
| Concurrency | You can often reason about one thread at a time; locks are cheap. | Other parties act simultaneously and independently; you must design for it (see §2.4). |
Why pretending is dangerous. If your code thinks a remote call is local, it will (1) make far too many tiny remote calls and run unbearably slowly because latency is real; (2) assume the call either fully succeeds or throws — with no handling for "I have no idea if it worked"; (3) ignore that two clients can call at once. The paper notes these failures were long hidden because early distributed systems were tiny. Scale them up and the cracks become catastrophes. The lesson: build the network's reality into your interfaces on purpose — design APIs that are coarse-grained (few big calls, not many small ones), that return explicit failure outcomes, and that the programmer knows are remote.
2.8 The Eight Fallacies of Distributed Computing
The Eight Fallacies of Distributed Computing are a list of false assumptions that newcomers make. The first four are credited to Bill Joy and Dave Lyon at Sun; L. Peter Deutsch added three more around 1994; and James Gosling (creator of Java) added the eighth around 1997. They are called fallacies because each one feels obviously true and is, in fact, false — and each false belief carries a real cost.
| # | Fallacy | Why it is false | What it costs you if you believe it |
|---|---|---|---|
| 1 | The network is reliable | Cables get cut, switches reboot, packets drop. Failure is routine, not rare. | No error handling, no retries → requests hang forever, resources leak, the app stalls the first time a packet is lost. |
| 2 | Latency is zero | Light takes ~67ms to cross the Atlantic and back; real round-trips are far worse. Remote is not instant. | "Chatty" designs making hundreds of tiny remote calls become unusably slow. A page that loads fast in dev crawls in production. |
| 3 | Bandwidth is infinite | Links carry a finite amount per second; big payloads and many users saturate them. | Sending huge responses / unbounded data → bottlenecks, congestion, dropped packets, timeouts under load. |
| 4 | The network is secure | Anyone on the path can read, alter, or inject traffic; threats come from inside too. | No encryption/auth → eavesdropping, tampering, breaches. You get blindsided by attackers you assumed could not reach you. |
| 5 | Topology doesn't change | Servers are added/removed, IPs change, links reroute, cloud instances come and go constantly. | Hard-coded hosts/IPs and fixed routes break; you need service discovery and dynamic config, or things silently stop resolving. |
| 6 | There is one administrator | Different teams, vendors, and cloud providers each control a piece, with their own rules. | You cannot assume coordinated changes; conflicting policies, firewall rules, and upgrade schedules cause mystery outages. |
| 7 | Transport cost is zero | Moving bytes costs money (egress fees, bandwidth bills) and CPU (serializing/deserializing data). | Surprise cloud bills and CPU spent packing/unpacking data you ignored in your budget and performance plan. |
| 8 | The network is homogeneous | Real systems mix operating systems, languages, hardware, and protocols. | Assuming everyone speaks your exact format → incompatibility; you need standard, portable formats (e.g. JSON, Protobuf) and explicit contracts. |
Common mistakes & misconceptions
Best practices
Section summary
- Partial failure is the defining hardship: parts fail independently, and a crashed node is indistinguishable from a slow one — silence is ambiguous.
- The network is unreliable: messages can be lost, delayed, duplicated, and reordered. "I sent it, so it arrived once, in order" is always wrong eventually.
- There is no global clock: every machine's clock differs and drifts, so cross-node timestamps cannot be trusted for ordering or correctness.
- Concurrency is the default: many nodes act at once with no inherent order, so shared resources need deliberate coordination.
- The Two Generals Problem proves you can never reach 100% certainty of agreement over an unreliable channel — you must act under uncertainty and stay recoverable.
- These realities force three survival tools: timeouts (stop waiting on the dead), retries (recover from loss), and idempotency (make the resulting duplicates harmless).
- Waldo et al. (1994) warned that a remote call is fundamentally different from a local one in latency, memory access, partial failure, and concurrency — pretending otherwise breaks systems at scale.
- The Eight Fallacies (Joy/Lyon, Deutsch, Gosling) list the comfortable single-machine assumptions that become false — and costly — the moment a network is involved.
- Distributed-systems engineering is largely the discipline of refusing these assumptions and designing explicitly for failure, delay, duplication, and disorder.