How to Use This Guide
Consensus is the problem of getting a group of separate computers to agree on one value — one decision, one order of events — even though some of them may crash, restart, or fall out of touch at the worst possible moment. It sounds small. It is the single hardest problem in distributed systems, because the network can lose, delay, or reorder messages, and a machine that has gone quiet is indistinguishable from one that is merely slow. Solve consensus reliably and you can build a system that stays correct and available while its parts fail underneath you. That is why consensus is the crown jewel of the field: databases, lock services, configuration stores, and schedulers all rest on it.
The six sections are deliberately ordered so each one earns the next:
- 1. The Consensus Problem — what agreement actually requires (agreement, validity, termination), why a majority is the magic number, and why FLP says no algorithm can be perfect.
- 2. Replicated State Machines & the Log — the universal trick: if every node applies the same commands in the same order, they stay identical. Consensus reduces to agreeing on a log.
- 3. Raft — Leader Election — Raft's first job: pick one leader per term so there is a single source of truth for the log.
- 4. Raft — Log Replication, Safety & Membership — how the leader copies entries, commits them by majority, guarantees safety, and changes the cluster's membership without breaking.
- 5. Paxos — The Original Consensus Algorithm — the foundational algorithm Raft was reacting to: proposers, acceptors, and the two-phase prepare/accept dance.
- 6. Multi-Paxos, Raft vs Paxos & the Real World — running Paxos for a stream of decisions, an honest comparison, and the systems (etcd, ZooKeeper, Spanner) that ship this for real.
Throughout, look for key boxes (the one idea you must keep), tips (practical guidance), and warnings (the traps that bite real engineers).