Paxos — The Original Consensus Algorithm
This is the algorithm that started it all. Paxos is a protocol that lets a group of unreliable machines agree on a single value, even when some of them crash, messages are lost or delayed, and several machines try to propose different values at the same time. "Consensus" here means exactly that: getting a set of nodes to agree on one value and never change their minds.
Paxos is famous for two opposite things at once: it is provably correct, and it is notoriously hard to understand. We are going to teach it slowly, one piece at a time. By the end you will see that the core idea is actually small — it is just packed very tightly.
5.1 History and why Paxos still matters
Leslie Lamport (the same person who gave us logical clocks and happens-before) first described Paxos in a 1990 paper called "The Part-Time Parliament." He wrote it as a joke story about a fictional Greek island called Paxos, whose parliament passed laws even though legislators kept wandering in and out of the chamber (the "parliamentarians" are the unreliable machines). The joke backfired: people found the story so confusing that almost nobody understood the algorithm hidden inside it. The paper sat largely ignored for years.
In 2001 Lamport gave up on the joke and published "Paxos Made Simple," a short, plain-English re-explanation. Its famous opening line is roughly: "The Paxos algorithm, when presented in plain English, is very simple." Many readers disagreed about the "simple" part — which is exactly why the Raft paper (the next section) was written, with the explicit goal of being understandable. But Paxos still matters enormously, because it is the theoretical bedrock. Google's Chubby lock service and Spanner database, Microsoft's Azure storage, Amazon's internal systems, and many others are built on Paxos or close relatives of it. Even systems that use Raft or ZAB instead are using ideas that Paxos established and proved correct first.
5.2 The setup: roles and the goal
Before any steps, we need the cast of characters. Paxos splits the work into three roles. A "role" is just a job a machine performs — and importantly, one physical machine can play several roles at once. In real deployments every server is usually a proposer, an acceptor, and a learner all together. We separate the roles only to explain the logic cleanly.
| Role | Plain-English job | What it does |
|---|---|---|
| Proposer | The one who suggests a value | Tries to get a value chosen. Drives the whole protocol by sending requests. |
| Acceptor | The fault-tolerant memory / voter | Votes on proposals and remembers what it has promised and accepted, even across crashes. The acceptors are the system's persistent memory. |
| Learner | The one who finds out the result | Learns which value was finally chosen, so it can act on it (e.g. apply it to a database). |
The goal of single-decree Paxos ("single-decree" = deciding one value, one single decision) is precise:
- Safety: Only a value that was actually proposed can be chosen. Only one value is ever chosen. Once a value is chosen, it is chosen forever — nobody can un-choose it or choose a different one. A learner never learns a value unless it really was chosen.
- Liveness (best-effort): If things stabilize, some proposed value eventually does get chosen and learners find out. (As we will see, Paxos cannot guarantee this in a fully asynchronous world — that is the FLP result from the foundations guide — so it settles for "eventually, once the network behaves.")
The mechanism that makes a value "chosen" is a majority of acceptors. A majority means more than half (for 5 acceptors, that is 3). The magic property of majorities is that any two majorities must overlap in at least one acceptor. Two groups of 3 chosen from 5 must share at least one member. That single overlapping acceptor is the secret that makes Paxos safe — keep it in mind, we will use it.
5.3 Proposal numbers (ballots): giving every attempt a unique, ordered ID
Several proposers may be active at once, and a single proposer may retry after a failure. To keep order out of this chaos, every proposal attempt is tagged with a proposal number (also called a ballot number). Two rules govern these numbers:
- They are totally ordered — given any two, you can always say which is bigger. Bigger means "newer / takes priority."
- They are globally unique — no two proposers ever pick the same number. This is essential: if two proposals shared a number, the tie-breaking that keeps Paxos safe would fall apart.
How do you make numbers unique across machines with no shared clock? The standard trick is to build each number out of two parts: a locally-increasing counter combined with the server's own unique ID — for example the pair (counter, serverId), compared counter-first and breaking ties by serverId. Server A's numbers might be 1.A, 2.A, 3.A while server B's are 1.B, 2.B, 3.B; they can never collide because the serverId differs. A simpler-to-picture version: give proposer A only odd numbers and proposer B only even numbers. Either way, each proposer can always pick a number higher than any it has seen, and no two numbers are ever equal.
5.4 Phase 1 — Prepare / Promise
Paxos runs in two phases. Phase 1 is about claiming priority and discovering history. The proposer is essentially asking, "Can I have a turn at number n, and by the way, has anything already been agreed that I need to honor?"
Phase 1a — Prepare (proposer → acceptors). The proposer picks a brand-new proposal number n (higher than any it has used) and sends a Prepare(n) message to at least a majority of acceptors. Note: at this point it sends no value yet — only the number.
Phase 1b — Promise (acceptor → proposer). When an acceptor receives Prepare(n):
- If
nis greater than every proposal number this acceptor has already responded to, it replies with a Promise. The Promise is a binding commitment: "I promise I will never again accept any proposal numbered less than n." It is effectively raising the floor. - Crucially, along with the promise the acceptor reports back any value it has already accepted — specifically the value, tagged with the proposal number under which it accepted it: it returns
(accepted_number, accepted_value), or "nothing" if it has accepted no value yet. This is how already-agreed history travels forward. - If
nis not higher than something it has already promised, the acceptor ignores the request (or sends a "no" / NACK so the proposer can give up faster). - The acceptor must write this promise to stable storage (disk) before replying, so that if it crashes and restarts it still remembers the floor it raised. The acceptors are the durable memory of the system.
PHASE 1: Prepare / Promise (proposer P, acceptors A1 A2 A3) P A1 A2 A3 |---Prepare(n)--->| | | |---Prepare(n)------------------>| | |---Prepare(n)------------------------------->| | | | | | (each acceptor: is n > my highest seen number?) | (A1,A2,A3 had accepted nothing yet -> promise, report "none") | | | | |<--Promise(n, none)| | | |<--Promise(n, none)-------------| | |<--Promise(n, none)---------------------------| | | P heard from a MAJORITY (3 of 3). It may proceed to Phase 2. | No prior value reported -> P is free to use its OWN value.
5.5 Phase 2 — Accept / Accepted
Phase 2 is where an actual value gets pinned down. The proposer now knows it has the floor (a majority promised) and knows the relevant history (whatever values came back in the promises).
Phase 2a — Accept (proposer → acceptors). If the proposer received Promises from a majority of acceptors, it now chooses which value to push, using one strict rule:
- Look at all the Promises that came back. If any of them reported an already-accepted value, the proposer must use the value attached to the highest accepted_number among them. It does not get to use its own value — it is forced to carry forward the most recent already-accepted value.
- Only if none of the majority reported any accepted value is the proposer free to propose its own value.
Call the chosen value v. The proposer sends Accept(n, v) to the acceptors.
Phase 2b — Accepted (acceptor → proposer/learners). When an acceptor receives Accept(n, v), it accepts it unless it has, in the meantime, already promised some higher number than n. In plain terms: "I'll accept this as long as nobody newer than you came along while you were gone." On accepting, it records (n, v) durably and notifies the learners. When a majority of acceptors have accepted (n, v), the value v is chosen — permanently. Learners that hear a majority of Accepted messages know the decision and can act on it.
PHASE 2: Accept / Accepted (continuing from Phase 1) P A1 A2 A3 | P picks v = its own value (no prior value was reported). |---Accept(n, v)->| | | |---Accept(n, v)---------------->| | |---Accept(n, v)------------------------------>| | | | | | (each: have I promised anything > n? No -> accept (n,v), save to disk) | | | | |<--Accepted(n,v)-| | | |<--Accepted(n,v)----------------| | |<--Accepted(n,v)------------------------------| | | MAJORITY accepted (n,v) ==> v is CHOSEN, forever. | Learners that see a majority of Accepted(n,v) now know: the answer is v.
"node-7". (1) P sends
Prepare(5.P) to all five. A1, A2, A3 reply Promise(5.P, none) — that is a majority of 3, none has accepted anything. (2) Since no prior value came back, P is free to use its own value, so it sends
Accept(5.P, "node-7"). (3) A1, A2, A3 had promised nothing higher than 5.P, so they accept and persist
(5.P, "node-7") and tell the learners. Three of five accepted — a majority — so "node-7" is now the chosen leader, irrevocably, even if P crashes one millisecond later.5.6 Why Paxos is SAFE — the heart of the algorithm
This is the part worth reading twice. Everything above exists to make one guarantee true: once a value could possibly have been chosen, every future proposal is forced to propose that same value. So the system can never choose two different values. Here is the chain of reasoning, built from two facts you already have.
Fact 1 — Majorities overlap. Suppose value v was chosen at proposal number n. By definition that means a majority of acceptors — call it set M1 — accepted (n, v).
Fact 2 — A later proposal must talk to a majority too. Now some proposer comes along later with a higher number n' (where n' > n). To get anywhere it must complete Phase 1: it must collect Promises from its own majority — call it set M2. Because any two majorities overlap, M1 and M2 share at least one acceptor. Call that shared acceptor X.
The punchline. Acceptor X is in M1, so it accepted (n, v). When the later proposer asks X in Phase 1, X reports back the value it accepted — (n, v) — inside its Promise. Now the Phase 2a rule kicks in: the later proposer is forced to use the highest-numbered accepted value it heard about. X just told it about v at number n. So the later proposer cannot use a value of its own — it must re-propose v. Therefore proposal n' also proposes v. The same argument applies to n'', n''', and so on forever. Every proposal numbered higher than n proposes v. No second value can ever be chosen.
The subtle bit that closes the last hole: how can the later proposer be sure it learned about a chosen value, if maybe only some of M1 had accepted by then? The answer is the promise. When X promised not to accept anything below n', it also locked the door: any proposal between n and n' can no longer slip a different value past X. So either the later proposer sees v directly, or no competing value could have reached a majority through X in the first place. The "promise" (raise the floor) and the "adopt the highest accepted value" rule work as a pair: the promise blocks the past from changing, and the adopt-rule forces the future to inherit it.
5.7 The liveness problem — dueling proposers and FLP
Paxos is always safe, but it is not always guaranteed to finish. The classic failure mode is dueling proposers (also called livelock — the system is busy but makes no progress). Picture two proposers, P1 and P2, each retrying with ever-higher numbers, perfectly out of step:
DUELING PROPOSERS (livelock) P1 and P2 over acceptors A1 A2 A3 P1: Prepare(1) ----------------------> A1 A2 A3 promise(1) P2: Prepare(2) ---------------> A1 A2 A3 promise(2) (raises floor to 2) P1: Accept(1, x) --------------------> A1 A2 A3 REJECT (already promised 2 > 1) P1: Prepare(3) --------> A1 A2 A3 promise(3) (raises floor to 3) P2: Accept(2, y) --------------------> A1 A2 A3 REJECT (already promised 3 > 2) P2: Prepare(4) --> A1 A2 A3 promise(4) P1: Accept(3, x) --------------------> A1 A2 A3 REJECT (already promised 4 > 3) ... and so on, forever. Each Prepare invalidates the other's pending Accept. No value is ever chosen, even though nothing crashed.
Each proposer's Phase 1 keeps raising the floor and knocking out the other's Phase 2 before it can complete. Nobody crashed, no message was lost — yet no progress is made. This is not a bug in Paxos; it is the FLP impossibility result from the foundations guide showing up in real life: in a fully asynchronous system (no bound on message delays, no perfect failure detection), no consensus algorithm can guarantee it always terminates. Paxos chooses to sacrifice guaranteed liveness to keep absolute safety. Safety is never violated; only progress can stall.
The standard fix has two parts:
- Elect a single distinguished proposer (a "leader"). Agree — informally, by a weak election that doesn't need to be perfect — that only one proposer normally issues proposals. With just one proposer, there is nobody to duel with, so Phase 1 and Phase 2 complete cleanly. Safety does not depend on the election being correct: if two leaders accidentally exist, Paxos is still safe (two values can never be chosen) — you only risk a temporary stall.
- Randomized / exponential backoff. When a proposer's attempt is rejected, it waits a random amount of time before retrying (and the wait grows on repeated failures). Randomness desynchronizes the duelists so one of them gets a clean window to finish both phases. This is the same trick Ethernet and Raft elections use to break symmetric ties.
A common optimization, Multi-Paxos, takes the leader idea further: a stable leader runs Phase 1 just once to claim a range of decisions, then only does the cheap Phase 2 for each new value. This makes the steady state one round-trip per decision — and it is essentially the shape Raft formalizes with a strong leader.
Common mistakes & misconceptions
Best practices
(counter, serverId). Never rely on wall-clock time alone for uniqueness or ordering.Section summary
- Paxos, from Lamport's Part-Time Parliament (1990) and Paxos Made Simple (2001), is the original, provably-correct consensus algorithm and the ancestor of nearly every modern one.
- It uses three roles — proposer (suggests), acceptor (votes and remembers durably), learner (finds out the result) — and one machine can play all three.
- The goal of single-decree Paxos is to get a majority of acceptors to agree on exactly one value, permanently and safely.
- Every attempt carries a unique, totally-ordered proposal number, typically built as
(counter, serverId)so no two ever collide. - Phase 1 (Prepare/Promise): proposer sends
Prepare(n); an acceptor promises to reject anything belownand reports back any value it already accepted. - Phase 2 (Accept/Accepted): with a majority of promises, the proposer sends
Accept(n, v), wherevis the highest already-accepted value it learned about — or its own value if there was none; acceptors accept unless they've promised something higher. A majority of accepts = chosen forever. - Safety rests on three things together: majorities always overlap, acceptors remember, and new proposers must adopt the highest already-accepted value — so the first value to reach a majority can never be overridden.
- Paxos can livelock with dueling proposers (the FLP impossibility made concrete); it sacrifices guaranteed liveness, never safety.
- The fix is a single distinguished/leader proposer plus randomized backoff; Multi-Paxos extends this to an efficient one-round-trip-per-decision log, the shape Raft later formalized.