Why Distributed Systems Are Hard

By Pritesh Yadav 24 min read

In the first section you learned what a distributed system is: a group of separate computers that work together over a network but appear, to the user, as one system. This section answers a harder question: why is building one so difficult? If you have only ever written ordinary programs that run on a single machine, the rules you have relied on your whole career quietly stop being true. The goal of this section is to show you exactly which rules break, why they break, and what that forces you to do differently.

Let us define a few words up front, in plain English, because we will use them constantly.

  • Node — one computer (or one running process) that is part of the system. A node could be a server, a database replica, or a microservice.
  • Network — the wires, routers, and software that carry messages between nodes. Think of it as the postal service between computers.
  • Message — a chunk of data one node sends to another (a request, a reply, a heartbeat).
  • Process — a running program. On one machine you may have many.
  • State — the data a node currently holds in memory or on disk (account balances, user sessions, counters).

On a single machine, all your code shares the same memory, the same clock, and the same fate: if the program crashes, everything stops together. A distributed system throws away all three of those comforts. That is the root of every difficulty below.

2.1 Partial failure: the defining problem

What it is. Partial failure means that some parts of the system fail while other parts keep running. One server out of ten dies, or one network link goes down, but the rest of the system carries on, unaware.

Why it exists. On a single computer, failure is total. If the power dies, your whole program dies; there is no awkward in-between state where half your variables are alive and half are dead. But in a distributed system the parts are physically separate. They have separate power supplies, separate memory, separate operating systems. So nothing forces them to fail together. The normal case is that one piece breaks and the others do not even notice yet.

How it works, step by step. Imagine node A asks node B to "save this order."

   Node A                          Node B
     |                               |
     |---- "save order #42" -------->|   (1) request travels over network
     |                               |
     |                          [B saves it]   (2) B does the work
     |                               |
     |   X  <----- "done!" ----------|   (3) reply is LOST on the way back
     |                               |
   (A waits... and waits...)         |   (4) A never hears back

Look at A's situation in step 4. A sent a request and heard nothing. A cannot tell which of these happened:

  1. The request never reached B (so the order was not saved).
  2. B saved the order but the reply was lost (so the order was saved).
  3. B is just slow and the reply is still coming (so the order will be saved).
  4. B crashed completely (so… maybe saved, maybe not, depending on timing).

This is the cruel heart of distributed systems: from the outside, a node that has crashed looks identical to a node that is merely slow. Silence tells you nothing. There is no signal that says "I am definitely dead." A dead node and a node stuck behind a slow network both produce the exact same thing: no answer.

Analogy: You text a friend "Are we still on for dinner?" and get no reply. Did the text fail to send? Did they read it and not answer? Are they driving and will reply in ten minutes? Did they drop their phone in a lake? You genuinely cannot tell the difference from your end — all four feel the same: silence. Now imagine your whole business depends on guessing correctly. That is distributed computing.

Why it matters. In single-machine code you almost never write "what if assigning this variable only half-worked?" — it cannot half-work. In distributed code, every single remote interaction can half-work, and you must have a plan for it. Ignoring partial failure does not make it go away; it just means your system will corrupt data or hang the first time a packet is lost in production.

Key takeaway: Partial failure is the one problem that has no equivalent in single-machine programming, and it is the source of almost everything else that is hard. You can never assume "it worked" just because you did not see an error — silence is ambiguous.

2.2 The network is unreliable

What it is. The network — the thing that carries your messages — is not a reliable pipe. Messages can be lost (never arrive), delayed (arrive much later than expected), duplicated (arrive more than once), or reordered (arrive in a different order than you sent them).

Why it exists. A network is not a single wire. It is a long chain of cables, switches, and routers, often spanning buildings, cities, or continents. Each hop has buffers that can overflow and drop packets. Routers reboot. Cables get cut. Software retransmits, sometimes producing a duplicate. There is no physical law guaranteeing that what you send arrives, arrives once, or arrives in order.

Here are the four failure modes drawn out:

SENT (in order):     [M1] [M2] [M3] [M4]

LOST:                [M1] [--] [M3] [M4]   M2 vanished
DELAYED:             [M1]      [M3] [M4] ......... [M2]   M2 arrives very late
DUPLICATED:          [M1] [M2] [M2] [M3]   M2 arrived twice
REORDERED:           [M1] [M3] [M2] [M4]   M3 overtook M2

How it works, step by step. Say you send "withdraw $100" to a bank node. The network drops it. You see no response. You assume it failed and send it again. This time it arrives — but so does the first one, which was only delayed, not lost. The account is now debited $200. The unreliable network turned one intended action into two.

Example: This is not theoretical. TCP (the protocol your web requests ride on) exists precisely because the raw network is unreliable. TCP adds sequence numbers (to detect reordering and duplicates), acknowledgements (to detect loss), and retransmission (to recover from loss). But TCP only fixes ordering and loss within one connection. The moment a connection drops — and connections drop all the time — TCP cannot tell you whether your last message was processed. The application is back to guessing.
Analogy: Sending messages over a network is like mailing postcards. Most arrive. Some get lost. Occasionally the post office accidentally prints two copies. Sometimes a postcard you mailed Monday shows up after the one you mailed Tuesday. You would never run a bank by mailing postcards and assuming each one arrives exactly once, in order. Yet that is the medium every distributed system is built on.
Key takeaway: Treat the network as an enemy that loses, delays, duplicates, and reorders your messages. Any code that assumes "I sent it, so it arrived exactly once" is already broken; it just has not failed yet.

2.3 No global clock: there is no shared "now"

What it is. Every machine has its own clock, and no two clocks agree exactly. There is no single, shared, authoritative "now" that all nodes can read. Worse, clocks drift — they speed up or slow down slightly over time — so the disagreement changes constantly.

Why it exists. A clock is a physical oscillator (a tiny vibrating crystal). No two crystals vibrate at exactly the same rate; temperature and age change the rate. So each machine's clock slowly wanders away from the others. Systems try to fix this with NTP (Network Time Protocol — software that periodically asks a time server "what time is it?" and nudges the local clock toward it). But NTP itself runs over the unreliable, variable-delay network, so it can only get clocks close, not identical. Clocks can be off by milliseconds normally, and by seconds or more when something goes wrong.

Why it matters. A huge amount of ordinary programming secretly relies on a shared clock. "Which write happened last?" "Did this token expire?" "Order these events by timestamp." On one machine those questions are trivial — there is one clock. Across machines, comparing two timestamps from two different clocks is meaningless, because the clocks may disagree by more than the time gap you are trying to measure.

Real time:   ----t1----t2---------------------->

Node A clock:  reads 10.000s at t1   (A's clock is 30ms FAST)
Node B clock:  reads  9.950s at t2   (B's clock is 20ms SLOW)

Event X happens on A, timestamped 10.000
Event Y happens on B LATER, timestamped  9.950

Naive sort by timestamp says: Y before X.   WRONG — Y came AFTER X.
Example: Apache Cassandra (a popular distributed database) historically used a "last write wins" rule based on wall-clock timestamps. If two servers had slightly skewed clocks, an older write with a higher timestamp could silently overwrite a newer write with a lower timestamp — quietly losing the customer's most recent update. This is a real, well-documented data-loss footgun, and its root cause is exactly "there is no global now." Systems that need a real ordering use logical clocks (Lamport clocks, vector clocks — covered later) or special hardware. Google Spanner goes so far as to install GPS receivers and atomic clocks in its data centres, and its "TrueTime" API deliberately returns a time interval ("now is somewhere between T1 and T2") rather than a single instant — an honest admission that exact time is unknowable, then waits out the uncertainty to stay correct.

Analogy: Imagine a meeting scheduled across a team where everyone's watch is set slightly differently and each watch drifts. If you log events by "what my watch said," you cannot reliably reconstruct what happened first. You need a shared logbook with a strict sign-in order, not everybody's personal watch. In distributed systems that shared logbook is a logical clock, not a wall clock.
Common mistake: Using server timestamps to decide ordering or "who wins" between two nodes. Clock skew (the gap between two clocks) and clock drift make this unreliable. Use it for rough display ("posted about 5 minutes ago"), never for correctness decisions.

2.4 Concurrency: many things happen at once

What it is. Concurrency means multiple things are happening at the same time, on different nodes, with no built-in agreement about their order. Two users, on two servers, can act on the same data in the same instant.

Why it exists. The whole point of a distributed system is that many nodes run in parallel — that is how you serve millions of users. But parallel means simultaneous, and simultaneous means there is no obvious "this happened before that." On one machine you can wrap shared data in a lock and force one-at-a-time access cheaply, because everything shares memory. Across machines there is no shared memory to lock; coordinating order now requires sending messages over the unreliable network, which is slow and can fail.

How it works, step by step. Two customers buy the last concert ticket at the same moment:

 Time -->
 User A  --> Server 1: "tickets left?"  reads 1
 User B  --> Server 2: "tickets left?"  reads 1     (both read at once!)
 User A  --> Server 1: "buy 1"          writes 0
 User B  --> Server 2: "buy 1"          writes 0
 RESULT: two tickets sold, one ticket existed.  Oversold.

Each server, looking only at itself, behaved correctly. The bug only exists in the interleaving of the two — and that interleaving is invisible to either node. This is why distributed systems need explicit coordination tools (locks held in a service like etcd or ZooKeeper, or database transactions) to serialize access to shared resources.

Key takeaway: In a distributed system, "at the same time" is the default, not the exception. Any shared resource needs a deliberate coordination strategy, because the order of concurrent actions is not decided for you.

2.5 The Two Generals Problem: you can never be 100% sure

What it is. The Two Generals Problem is a famous thought experiment that proves a startling fact: over an unreliable channel, two parties can never become completely certain they agree. It was the first computer-communication problem ever proven to be unsolvable.

The setup. Two generals, A and B, are camped on opposite hills with the enemy in the valley between them. They will win only if they attack at the same time. If just one attacks alone, that army is destroyed. They can communicate only by sending a messenger through the valley — and the messenger may be captured, so any message may be lost.

   General A  ))))) valley (enemy may capture messenger) (((((  General B
       |                                                          |
       |---- "Attack at dawn!" --------------------------------->|   may be lost
       |                                                          |
       |<--- "Agreed — at dawn!" (ack) --------------------------|   may be lost
       |                                                          |
       |---- "Got your ack!" (ack of the ack) ----------------->|   may be lost
       |                          ... and so on, forever ...      |

Why it is unsolvable. Suppose A sends "Attack at dawn." A cannot act on it until A knows B received it — so B must send an acknowledgement (an ack, a "got it" reply). But that ack might be lost, so B cannot be sure A got the ack — so A must ack the ack. But that might be lost… Every message needs a confirmation, and every confirmation needs a confirmation. There is no last message you can fully trust, because the last message might be the one that gets lost. No finite number of messages ever produces certainty. The proof is by contradiction: take any protocol that supposedly works, delete its final message — if the protocol still works, that message was unnecessary, so remove it too; repeat until no messages remain, which obviously cannot coordinate an attack. Contradiction. So no such protocol exists.

Why it matters in practice. This is not just a puzzle. Every time your service calls another service, you are a general sending a messenger. When you get no reply, you are in exactly the general's dilemma: did they get it or not? You cannot achieve certainty. What you do instead is settle for being probably sure and designing so that the remaining doubt is harmless. The three tools below are how we live with that doubt.

Analogy: "Read receipts." You text a plan and want to be sure your friend will show up. You see "Delivered" — but did they read it? They send "👍" — but do they know you saw their 👍? You could volley confirmations forever and still never reach perfect certainty that you are both on the same page. At some point one of you just acts, accepting a small risk. Distributed systems do the same: act under uncertainty, and engineer so a wrong guess is recoverable.

2.6 The consequences: timeouts, retries, idempotency

The problems above force three coping tools into almost every distributed system. Learn the words now; you will see them everywhere.

Timeout

What it is. A timeout is a maximum time you are willing to wait for a reply before giving up and assuming something went wrong. "If B has not answered in 5 seconds, stop waiting."

Why we need it. Because of partial failure (§2.1), a node may never answer. Without a timeout, your code would wait forever, holding memory and connections, until it too falls over. The hard part: a timeout cannot tell you whether the work happened. It only tells you that you stopped waiting. A timed-out request may still have succeeded on the other side.

Retry

What it is. A retry is simply sending the request again after a timeout or error, in the hope it works this time.

Why we need it. The network loses messages (§2.2), and many failures are temporary (a brief blip, an overloaded server). Retrying recovers from those. The danger: remember the "withdraw $100 twice" story. If the first request actually succeeded but its reply was lost, your retry does the work a second time. Retries turn lost-reply situations into duplicate-action situations — which is why we need the next tool.

Idempotency

What it is. An operation is idempotent if doing it twice has the same effect as doing it once. "Set status to PAID" is idempotent — running it five times still just means PAID. "Add $100 to the balance" is not idempotent — running it five times adds $500.

Why we need it. Because retries (and duplicated network messages) mean the same request will arrive more than once. If your operations are idempotent, duplicates are harmless. The common technique is an idempotency key: the client attaches a unique ID to the request; the server remembers which IDs it has already processed and ignores repeats.

Example: Stripe, the payment company, builds its entire API around this. Every charge request can carry an Idempotency-Key header. If your code times out and retries a charge, Stripe sees the repeated key and returns the original result instead of charging the customer twice. This single idea is what makes it safe to retry over an unreliable network. The same pattern protects message systems: Apache Kafka added an "idempotent producer" so that a producer retrying after a network hiccup does not write the same record twice.
Without idempotency:                  With idempotency key "k9":
  send "charge $50"  --X (lost reply)   send "charge $50" key=k9 --X (lost reply)
  timeout -> retry "charge $50"         timeout -> retry  key=k9
  RESULT: charged $100  (BUG)           server: "seen k9 already" -> $50 once  (SAFE)
Key takeaway: Timeouts + retries + idempotency are the standard survival kit. Timeouts stop you waiting on the dead; retries recover from lost messages; idempotency makes the inevitable duplicates harmless. You almost never want one without the other two.

2.7 "A Note on Distributed Computing" (Waldo et al., 1994)

In November 1994, four engineers at Sun Microsystems — Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall — published a now-famous paper called "A Note on Distributed Computing." It is one of the most important warnings in the field, and its message is simple:

A remote call is NOT just a local function call that happens to go over a wire. Pretending it is will eventually destroy your system.

The problem it attacks. In the early 1990s, a popular idea was transparency: make calling a function on another machine look exactly like calling a function in your own program, so developers do not have to think about the network. This was sold as a convenience. The paper argues it is a trap. (The technique of making a remote call look local is called RPC — Remote Procedure Call. It is genuinely useful, but the illusion of sameness is the danger.)

The four fundamental differences the paper says you can never hide:

DimensionLocal call (same machine)Remote call (over network)
Latency (how long it takes)Nanoseconds. Effectively free.Hundreds of microseconds to hundreds of milliseconds — thousands to millions of times slower. A loop that was instant locally can take minutes remotely.
Memory accessYou share one address space; a pointer on one side means something on the other.No shared memory. A pointer is meaningless across machines; data must be copied (serialized) and sent. References do not transfer.
Partial failureEither the whole program runs or it crashes — together.The other side can be dead, slow, or unreachable while you keep running, and you cannot tell which (see §2.1).
ConcurrencyYou can often reason about one thread at a time; locks are cheap.Other parties act simultaneously and independently; you must design for it (see §2.4).

Why pretending is dangerous. If your code thinks a remote call is local, it will (1) make far too many tiny remote calls and run unbearably slowly because latency is real; (2) assume the call either fully succeeds or throws — with no handling for "I have no idea if it worked"; (3) ignore that two clients can call at once. The paper notes these failures were long hidden because early distributed systems were tiny. Scale them up and the cracks become catastrophes. The lesson: build the network's reality into your interfaces on purpose — design APIs that are coarse-grained (few big calls, not many small ones), that return explicit failure outcomes, and that the programmer knows are remote.

Analogy: Asking a coworker sitting next to you a question is a "local call" — instant, reliable, you share context. Asking the same question by mailing a letter overseas is a "remote call." If you write your workflow as if every letter were a tap on the shoulder — fire off a hundred of them and assume each gets an instant, guaranteed answer — your project grinds to a halt and breaks the first time a letter is lost. The medium changes how you must work, no matter how much you wish it did not.

2.8 The Eight Fallacies of Distributed Computing

The Eight Fallacies of Distributed Computing are a list of false assumptions that newcomers make. The first four are credited to Bill Joy and Dave Lyon at Sun; L. Peter Deutsch added three more around 1994; and James Gosling (creator of Java) added the eighth around 1997. They are called fallacies because each one feels obviously true and is, in fact, false — and each false belief carries a real cost.

#FallacyWhy it is falseWhat it costs you if you believe it
1The network is reliableCables get cut, switches reboot, packets drop. Failure is routine, not rare.No error handling, no retries → requests hang forever, resources leak, the app stalls the first time a packet is lost.
2Latency is zeroLight takes ~67ms to cross the Atlantic and back; real round-trips are far worse. Remote is not instant."Chatty" designs making hundreds of tiny remote calls become unusably slow. A page that loads fast in dev crawls in production.
3Bandwidth is infiniteLinks carry a finite amount per second; big payloads and many users saturate them.Sending huge responses / unbounded data → bottlenecks, congestion, dropped packets, timeouts under load.
4The network is secureAnyone on the path can read, alter, or inject traffic; threats come from inside too.No encryption/auth → eavesdropping, tampering, breaches. You get blindsided by attackers you assumed could not reach you.
5Topology doesn't changeServers are added/removed, IPs change, links reroute, cloud instances come and go constantly.Hard-coded hosts/IPs and fixed routes break; you need service discovery and dynamic config, or things silently stop resolving.
6There is one administratorDifferent teams, vendors, and cloud providers each control a piece, with their own rules.You cannot assume coordinated changes; conflicting policies, firewall rules, and upgrade schedules cause mystery outages.
7Transport cost is zeroMoving bytes costs money (egress fees, bandwidth bills) and CPU (serializing/deserializing data).Surprise cloud bills and CPU spent packing/unpacking data you ignored in your budget and performance plan.
8The network is homogeneousReal systems mix operating systems, languages, hardware, and protocols.Assuming everyone speaks your exact format → incompatibility; you need standard, portable formats (e.g. JSON, Protobuf) and explicit contracts.
Example: A team builds a feature that, for each item on a page, makes one separate API call to a service — 200 items, 200 calls. On the developer's laptop the service is local, so latency is near zero (fallacy #2 in action) and it feels instant. Deployed, that service is in another data centre 80ms away. 200 sequential calls × 80ms = 16 seconds to load one page. Nothing is "broken" — they simply believed latency was zero. The fix (batch the 200 into one call) is exactly the "coarse-grained interface" lesson from Waldo's paper.
Analogy: The fallacies are like a new driver's wishful assumptions: "the road is always clear," "I'll always have a green light," "petrol is basically free," "every road is the same width." Each feels harmless until reality — traffic, red lights, fuel bills, a narrow bridge — proves it wrong, usually at the worst moment. Experienced drivers plan for all of them. Experienced distributed-systems engineers plan for all eight fallacies.
Key takeaway: Every fallacy is a comfortable assumption from single-machine programming that quietly becomes false the moment a network is involved. Designing a distributed system is largely the discipline of refusing to believe these eight things.

Common mistakes & misconceptions

Common mistake: Treating "no response" as "the request failed." A timed-out or lost request may have fully succeeded on the other side. Silence means "unknown," not "failed." Design for the unknown, not for a clean failure.
Common mistake: Adding retries without idempotency. Retrying a non-idempotent operation (charge, increment, "send email") over a network that duplicates and delays messages turns one lost reply into double charges, double emails, double inventory deductions. Always pair retries with idempotency keys.
Common mistake: Ordering events or resolving conflicts by comparing wall-clock timestamps from different machines. Clock skew and drift make cross-node timestamp comparisons unreliable; "last write wins" by timestamp can silently lose the newer write (the classic Cassandra footgun).
Common mistake: Believing RPC makes the network "transparent." Hiding the network does not remove latency, partial failure, or concurrency — it only removes your handling of them. As Waldo et al. warned, the illusion of a local call is exactly what makes systems brittle at scale.
Common mistake: Designing chatty interfaces — many small remote calls in a loop — because they were fast in local testing. Latency is not zero in production; coarse-grained, batched calls are the cure.
Common mistake: Assuming you can eventually reach 100% certainty that another node received your message if you just add "one more confirmation." The Two Generals Problem proves you cannot. Accept residual uncertainty and make a wrong guess recoverable instead of chasing impossible certainty.

Best practices

Best practice: Set a sensible timeout on every remote call. Never wait forever. A request with no timeout is a resource leak waiting to take down the caller along with the callee.
Best practice: Make write operations idempotent and use idempotency keys (Stripe's model). Then duplicates from retries or network re-sends become harmless, and you can retry freely.
Best practice: Use logical ordering (Lamport clocks, vector clocks, sequence numbers) or a dedicated coordination service (etcd, ZooKeeper) for correctness decisions. Reserve wall-clock time for human-facing display only.
Best practice: Design coarse-grained, network-aware interfaces. Make remote APIs do meaningful work per call, return explicit success/failure outcomes, and never pretend they are local function calls.
Best practice: Write down your assumptions and test them against the eight fallacies before shipping. Ask explicitly: what if the network drops this? What if it is slow? What if two of these run at once? What if the other node is dead?
Best practice: Inject failure on purpose. Use timeouts, dropped packets, and chaos testing in staging so partial failure is something your code has already survived before a customer triggers it in production.

Section summary

  • Partial failure is the defining hardship: parts fail independently, and a crashed node is indistinguishable from a slow one — silence is ambiguous.
  • The network is unreliable: messages can be lost, delayed, duplicated, and reordered. "I sent it, so it arrived once, in order" is always wrong eventually.
  • There is no global clock: every machine's clock differs and drifts, so cross-node timestamps cannot be trusted for ordering or correctness.
  • Concurrency is the default: many nodes act at once with no inherent order, so shared resources need deliberate coordination.
  • The Two Generals Problem proves you can never reach 100% certainty of agreement over an unreliable channel — you must act under uncertainty and stay recoverable.
  • These realities force three survival tools: timeouts (stop waiting on the dead), retries (recover from loss), and idempotency (make the resulting duplicates harmless).
  • Waldo et al. (1994) warned that a remote call is fundamentally different from a local one in latency, memory access, partial failure, and concurrency — pretending otherwise breaks systems at scale.
  • The Eight Fallacies (Joy/Lyon, Deutsch, Gosling) list the comfortable single-machine assumptions that become false — and costly — the moment a network is involved.
  • Distributed-systems engineering is largely the discipline of refusing these assumptions and designing explicitly for failure, delay, duplication, and disorder.

Continue reading