Distributed Systems — Many Computers Working as One

By Pritesh Yadav 16 min read

A distributed system is a group of separate computers that work together over a network (the wires and wireless links that carry data between machines) so that, to the person using it, they look and behave like a single system. Each computer is called a node. When you open Netflix, search Google, or check your bank balance, you are talking to thousands of nodes pretending to be one.

The computer scientist Leslie Lamport summed up the pain perfectly: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." That sentence is the whole chapter in disguise. Let's unpack why we build these systems anyway, and why they are so hard.

Why build a distributed system at all?

Running on one machine is simple. We give up that simplicity for three concrete reasons.

  • Scale — one machine has a fixed amount of CPU (its "thinking power"), RAM (its short-term memory), and disk (its storage). When your traffic or data grows past the biggest single machine money can buy, you spread the work across many machines. This is called horizontal scaling: add more boxes instead of buying one giant box.
  • Reliability / fault tolerance — one machine is a single point of failure: when it dies, everything dies. With copies of your service on many machines, the system survives one machine dying. Oddly, more machines means more total failures happen — but the system keeps running through them.
  • Geography / latencylatency means delay. A user in Tokyo and a user in London both want fast replies. Data cannot travel faster than light, so a round trip around the world always takes hundreds of milliseconds. Putting servers near each region cuts that delay.
Key takeaway: We accept the hard problems of a distributed system to get three things one machine can't give: more capacity (scale), survival through hardware failure (reliability), and speed for users far away (geography).

The one new problem that changes everything: partial failure

On a single machine, things either work or the whole thing crashes. Failure is total and obvious. In a distributed system the defining new problem is partial failure: some nodes work, some don't — and you cannot tell which.

Here is the heart of the difficulty. You send a request to another node and hear nothing back. There are four possible reasons, and they all look identical from where you sit:

  1. Your request never arrived (it got lost on the network).
  2. It arrived and was processed, but the reply got lost coming back.
  3. The other node is just slow and is still working on it.
  4. The other node is dead.
Analogy: You text a friend and get no reply. Did the text not arrive? Did they read it and not answer yet? Is their phone dead? You honestly cannot tell from your side. That uncertainty is partial failure. Every hard idea in this chapter is a strategy for staying correct despite not knowing the answer.

Network partitions and timeouts

A network partition is when the network splits so that two groups of healthy nodes can no longer talk to each other. Each group is fine; they just can't reach the other side. Each side may wrongly believe the other side has died.

   Group A (healthy)         Group B (healthy)
   +-----------+    broken    +-----------+
   |  A1   A2  |----- X ------|  B1   B2  |
   +-----------+    link      +-----------+
        |                          |
   "B must be dead"          "A must be dead"
   (both are wrong - the LINK died, not the nodes)

Your only practical tool for detecting failure is a timeout: "if I don't hear back in X seconds, assume it's dead." But a timeout is only a guess:

  • Too short → you wrongly declare healthy-but-slow nodes dead (a false positive) and retry needlessly, sometimes flooding the system.
  • Too long → you hang for ages waiting on nodes that are truly dead.

There is no perfect timeout. In fact, deciding "is that node down?" in an asynchronous network (one with no guaranteed delivery time) is provably impossible to do perfectly. A famous 1985 result, the FLP impossibility (named after Fischer, Lynch, and Paterson), proves that no algorithm can guarantee all nodes reach agreement in an asynchronous network if even one node may fail. Real systems get around this with timeouts and a bit of luck — they work in practice, just not with a mathematical guarantee.

Common mistake: Setting timeouts naively. Too short causes "retry storms" that pile load onto an already-struggling system; too long makes everything freeze on dead nodes. Pair timeouts with retries that back off (wait longer each try), add jitter (small random delays so clients don't all retry at the same instant), and circuit breakers (stop calling a failing service for a while).

The 8 Fallacies of Distributed Computing

These are false beliefs developers fall into when they treat the network like it's local and perfect. Peter Deutsch (at Sun Microsystems, around 1994) wrote the first seven; James Gosling added the eighth.

#FallacyReality
1The network is reliablePackets drop, cables get cut.
2Latency is zeroRemote calls are far slower than local ones.
3Bandwidth is infiniteBig payloads choke the link.
4The network is secureAssume it's hostile; encrypt and authenticate.
5Topology doesn't changeNodes, routes, and IPs change constantly.
6There is one administratorMany teams and configs; coordination is hard.
7Transport cost is zeroMoving data costs money and CPU.
8The network is homogeneousMixed hardware, protocols, and versions are normal.
Common mistake: Treating a remote call like a normal local function call — ignoring latency, failure, and the cost of turning data into bytes to send (called serialization). "Chatty" designs that make hundreds of tiny network calls are one of the top performance killers. Also fallacy 4: don't skip encryption (TLS) and login checks between internal services just because they're "behind the firewall." Assume the network is hostile.

CAP theorem in plain words

Eric Brewer proposed the CAP theorem in 2000; Gilbert and Lynch proved it in 2002. It names three properties:

  • C — Consistency (specifically linearizability): every read sees the most recent write. All nodes agree on one current value, as if there were a single copy.
  • A — Availability: every request to a node that hasn't failed gets a real (non-error) answer, even if that answer might be slightly out of date.
  • P — Partition tolerance: the system keeps working even when messages between nodes are dropped or delayed.

People say "pick 2 of 3," but here's the honest framing: in any real network, partitions will happen. So P is not optional. You don't get to "pick CA." The real choice is what to do during a partition:

  • CP — refuse to answer rather than risk giving wrong data.
  • AP — keep answering with possibly-stale data rather than go down.
Example: A bank ATM picks CP: during a partition it's better to say "try again later" than to show a wrong balance and let you withdraw money twice. A social media like-count picks AP: a slightly stale count is fine, but the page going down is not.
Common mistake: Believing you can "pick CA." Saying "we chose CA" really means "we didn't think about what happens during a partition." Partitions are unavoidable, so the only real decision is C-vs-A during one.

PACELC — the fuller picture

CAP only describes a partition and ignores the everyday cost of consistency. Daniel Abadi's PACELC (2010) fills the gap. Read it as: if there's a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency.

The insight: even on a perfectly healthy network, keeping copies strongly consistent requires extra coordination round-trips, and those add delay. Strong consistency always costs latency.

ClassMeaningExamples
PA/ELStay available in a partition; stay fast normally; weaker consistencyCassandra, Riak
PC/ECStrong consistency always, paying latencyGoogle Spanner, CockroachDB, classic ACID SQL
PA/ECAvailable under partition; consistent under normal opsMongoDB (default)

Consistency models — a spectrum, not a yes/no

Consistency isn't all-or-nothing. It runs from strongest (easiest to reason about, most expensive) to weakest (cheapest and fastest, hardest to reason about).

  • Strong / linearizable — behaves as if there's one single copy; a read always returns the latest committed write. Simplest to think about, highest coordination cost.
  • Causal — operations with a cause-and-effect link (like a reply to a comment) appear in that order everywhere. Unrelated operations may be seen in different orders on different nodes. A nice middle ground: it keeps things that "make sense together" in order without forcing global agreement.
  • Eventual — if writes stop, all copies eventually settle on the same value. Until then, reads may be stale or out of order. Cheapest and most available.

Session guarantees make eventual consistency feel sane for one user:

  • Read-your-writes — after you change something, you always see your own change (you post a comment and immediately see it).
  • Monotonic reads — once you've seen a value, you never see an older one (no "going back in time").
  • Monotonic writes — your writes are applied in the order you made them.
Common mistake: Thinking "eventual consistency" means "consistent within a few milliseconds, basically strong." It only guarantees that copies converge eventually if writes stop. Reads can be arbitrarily stale in the meantime. Where users will notice (like seeing their own post), add session guarantees such as read-your-writes.

Replication, quorums, and consensus

Replication means keeping copies of data on several nodes — for durability (you don't lose it if one disk dies), availability, and faster reads. The hard part is keeping the copies agreeing while nodes and networks fail.

A quorum is a voting trick. With N copies, require W nodes to confirm a write and R nodes to serve a read. If R + W > N, the read set and write set must overlap by at least one node — so a read is guaranteed to see the latest write.

  N = 3 replicas    W = 2 (write reaches 2)   R = 2 (read asks 2)
  Since 2 + 2 = 4  >  3, the sets MUST share a node.

   Write went to:  (n1, n2)
   Read asks:           (n2, n3)
                         ^^ overlap => read sees the fresh value

A majority quorum means more than half: ⌊N/2⌋+1. It tolerates a minority failing and prevents two separate groups from both committing — which would cause split-brain (two halves each thinking they're in charge, accepting conflicting writes).

Consensus is the act of making many nodes agree on a single ordered sequence of values — a replicated log (an append-only list of operations, in the same order on every node) — despite crashes.

  • Paxos (Leslie Lamport) — correct and foundational, but famously hard to understand and build.
  • Raft (Ongaro and Ousterhout, 2014) — designed to be understandable. At any time there's one leader; nodes are leader, follower, or candidate. Leader election uses randomized timeouts: whoever times out first becomes a candidate, asks for votes, and wins with a majority. Log replication: clients send to the leader, which appends the entry and forwards it to followers; once a majority confirms, the entry is committed. If the leader dies, a new election runs under a new term (a numbered period of leadership).
 Follower --(no heartbeat, times out)--> Candidate
 Candidate --(asks for votes, gets a majority)--> Leader
 Leader --(sends heartbeats)--> Followers
 Leader crashes --> a Follower times out --> new election (term+1)

Raft powers etcd, Consul, CockroachDB, and TiKV; Paxos-family algorithms underpin Google's Chubby/Spanner and ZooKeeper. Consensus needs a majority alive, so it tolerates ⌊(N−1)/2⌋ failures: 3 nodes tolerate 1, 5 nodes tolerate 2.

Best practice: Use an odd number of nodes (3, 5, 7). A 4-node cluster gives no more fault tolerance than 3 and makes tie votes (split-brain) possible. Also: don't confuse replication (just copying data) with consensus (agreeing on one order). You can replicate without consensus — but then you have no agreed order, which is exactly how split-brain and conflicting writes happen.

Retries, idempotency, and delivery guarantees

Because the network is unreliable, clients retry. But a retry can duplicate work: maybe your first request did succeed and only the reply got lost. The fix is idempotency: doing the operation twice has the same effect as doing it once.

Example: SET balance = 100 run twice still leaves 100 — safe to retry. But ADD 100 to balance run twice adds 200 — a lost-reply retry just double-charged the customer. The cure is an idempotency key: the client attaches a unique ID (say order-abc123); the server records which IDs it has processed and ignores duplicates.

Delivery guarantees describe how hard a system tries to deliver a message:

  • At-most-once — never retry; may lose messages. Simple but lossy.
  • At-least-once — retry until confirmed; may deliver duplicates. The common, practical default.
  • Exactly-once delivery over an unreliable channel is impossible.
Analogy: The Two Generals Problem. Two armies on opposite hills must attack at the same time, but every messenger crossing the valley might be captured. Even a confirmation can be lost — and so can the confirmation of the confirmation. Neither general can ever be 100% sure the other got the message. That's why no protocol can guarantee exactly-once delivery.

What real systems actually achieve is exactly-once processing = at-least-once delivery + idempotent processing on the receiving end. When a product says "exactly-once," this combination is what it means — never true network-level exactly-once.

Common mistake: Retrying a non-idempotent operation like "charge card" or "send email." Since a lost reply looks exactly like a lost request, blind retries cause duplicates. Make handlers idempotent (keys, dedup, upserts) before you add retries. And don't trust "exactly-once delivery" marketing — you still must write idempotent consumers.

Clocks and ordering — why wall clocks lie

It's tempting to order events across machines by their timestamps. Don't. Each machine's wall clock (its real-time-of-day clock) drifts. NTP (the protocol that syncs clocks over the internet) can even jump a clock backward to correct it. There is no single global "now." Comparing two timestamps from two servers can give the wrong order and silently lose data.

Instead we use logical clocks that order by cause-and-effect, not by time:

  • Lamport timestamps — each node keeps a counter, bumps it on every event, and on receiving a message sets counter = max(local, received) + 1. Guarantee: if A caused B, then A's number is smaller. But the reverse isn't true — a smaller number does not prove a causal link. Lamport clocks give a total order but lose information about which events were truly simultaneous.
  • Vector clocks — each node tracks a whole list of counters, one per node. By comparing two vectors you can tell whether two events are causally ordered or genuinely concurrent (neither caused the other). This is how systems detect conflicting concurrent writes that must be reconciled. The cost: the vector grows with the number of nodes.
Example: Three people edit a shared document. A Lamport number can say "edit 5 came after edit 3," but can't tell you whether two edits happened at the same time. Vector clocks like [A:2, B:1, C:0] versus [A:1, B:2, C:0] reveal that neither came before the other — they're concurrent and conflict, so a human or a merge rule must resolve them.

The rare exception is Google Spanner's TrueTime, which tames physical clocks using GPS and atomic clocks, then deliberately waits out a small uncertainty window to give globally consistent timestamps.

Common mistake: "Last write wins" by system clock. Clock drift and NTP backward jumps silently corrupt ordering and can throw away newer data. Use logical or vector clocks for causality, not raw wall-clock time.

Distributed transactions: 2PC versus Sagas

A transaction is a group of operations that must all succeed or all fail together (its ACID guarantee). Inside one database this is easy. Spanning several services or databases — a distributed transaction — is hard.

Two-Phase Commit (2PC) uses a coordinator:

  1. Phase 1 (prepare/vote) — the coordinator asks every participant "can you commit?" Each locks its resources and replies yes or no.
  2. Phase 2 (commit/abort) — if all said yes, tell everyone to commit; otherwise tell everyone to abort.

2PC gives atomicity but is blocking: participants hold locks while waiting, and if the coordinator crashes after the vote, participants can be stuck holding those locks indefinitely. The coordinator is a single point of failure. This is a poor fit for microservices because it couples services together and kills availability and throughput.

The Saga pattern takes a different path. Instead of one global transaction, run a sequence of local transactions, each committing on its own in its own service. If a later step fails, run compensating transactions that semantically undo the earlier ones — rather than rolling everything back at once.

Example (e-commerce order): (1) reserve inventory → (2) charge the card → (3) book shipping. If shipping fails, run compensations in reverse: refund the card, release the inventory. Compare with 2PC, where all three would hold locks until the coordinator finally said "commit."

Sagas come in two flavors: choreography (each service reacts to the others' events, no central brain) and orchestration (one central coordinator drives the steps). The trade-off versus 2PC: no global locks, far more scalable and available — but you give up isolation. Other operations can briefly see half-finished state, and every step needs a real undo action.

Common mistake: Reaching for 2PC across microservices "for safety," then watching it block and strand participants when the coordinator crashes. Prefer Sagas for cross-service workflows. But don't design only the happy path: a Saga has no isolation, so you must write a real compensation for every step and handle the window where intermediate state is visible.
Key takeaways:
  • A distributed system makes many computers act as one — for scale, fault tolerance, and low latency — but the network is unreliable and you can't tell "slow" from "dead." That ambiguity (partial failure) is the root of every hard problem.
  • Partitions are unavoidable, so CAP's real choice is C-vs-A during a partition (CP refuses, AP stays up); PACELC adds that even normally, strong consistency costs latency.
  • Consistency is a spectrum (strong → causal → eventual); session guarantees like read-your-writes make eventual consistency feel right to a single user.
  • Quorums (R + W > N) and majority consensus (Raft, Paxos) keep replicas agreeing and prevent split-brain; use odd node counts so N=3 tolerates 1 failure, N=5 tolerates 2.
  • Networks force retries, so make operations idempotent (idempotency keys); true exactly-once delivery is impossible — settle for at-least-once + idempotent processing.
  • Don't order events by wall clocks; use logical/vector clocks for causality. For cross-service transactions, prefer Sagas with compensations over blocking 2PC.

Continue reading