Distributed Systems
The hard middle of modern engineering — replication, partitioning, and failure.
20 posts · Engineering
- 1
How to Use This Guide
Consensus is the problem of getting a group of separate computers to agree on one value — one decision, one order of events — even though some of them may…
- 2
The Consensus Problem
This whole guide is about one deceptively small word: agree . You already know that in a distributed system there are many machines (we call them nodes ), the…
- 3
Replicated State Machines & the Log
This section gives you the one mental model that every practical consensus system — Raft, Paxos, ZooKeeper's ZAB, etcd, Consul, Google's Chubby and Spanner —…
- 4
Raft — Leader Election
Raft is an algorithm for keeping a replicated log in agreement across a cluster of servers. A "replicated log" is just an ordered list of commands (like "set x…
- 5
Raft — Log Replication, Safety & Membership
In the previous section you saw how Raft picks a single leader (the one server allowed to accept new commands) and how a term (a numbered period of time, each…
- 6
Paxos — The Original Consensus Algorithm
This is the algorithm that started it all. Paxos is a protocol that lets a group of unreliable machines agree on a single value, even when some of them crash,…
- 7
Multi-Paxos, Raft vs Paxos & the Real World
So far you have seen consensus as a way to agree on one value. In the previous sections, basic Paxos (single-decree Paxos) agrees on a single decision: one…
- 8
Glossary of Terms
Acceptor A Paxos role. Acceptors are the voting members; a value is chosen only when a majority of acceptors accept it.
- 9
Frequently Asked Questions
Q: Raft or Paxos — which should I actually use? A: For almost any new system, reach for a battle-tested Raft library (etcd's, or Hashicorp's) — Raft was…
- 10
Revision Cheat Sheet
One-line everything: Consensus = agree on one ordered log of commands, using a majority quorum, so identical replicated state machines stay in lockstep despite…
- 11
How to Use This Guide
Welcome. A distributed system is just a group of computers that work together to look like one single service to the people using it.
- 12
What Is a Distributed System?
Welcome to the very first step. You are a web developer, so you already know how to build software that runs on a computer.
- 13
Why Distributed Systems Are Hard
In the first section you learned what a distributed system is : a group of separate computers that work together over a network but appear, to the user, as one…
- 14
Time, Clocks & the Ordering of Events
One deceptively simple question: in what order did things happen? On a single computer this is trivial — you look at the clock, or you read the lines of a log…
- 15
Vector Clocks & Causality
In the previous section you met Lamport clocks — simple counters that give every event a number so we can put events in some order.
- 16
The CAP Theorem (and PACELC)
Once you split data across more than one machine, you run into a hard, unavoidable rule about what those machines can promise you.
- 17
Consistency Models
Imagine you save data in one place and it instantly appears everywhere. That is the dream. Reality is messier.
- 18
Glossary of Terms
Asynchronous network A network where messages can take any amount of time to arrive, with no guaranteed upper limit. The internet behaves this way.
- 19
Frequently Asked Questions
Q: What actually makes a system "distributed" — isn't every website on a server somewhere? A: A single server running everything is not distributed.
- 20
Revision Cheat Sheet
The 8 Fallacies of Distributed Computing (all are FALSE) The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn't…