Multi-Paxos, Raft vs Paxos & the Real World
So far you have seen consensus as a way to agree on one value. In the previous sections, basic Paxos (single-decree Paxos) agrees on a single decision: one slot, one chosen value, done. But real systems almost never want to agree on just one thing. A database wants to agree on a never-ending stream of operations: "set x=5", then "delete y", then "append to z", in a strict order that every replica applies identically. This section is about how the theory turns into working systems, and how you choose between them.
Let me define the central idea up front. A replicated log is an ordered list of commands, numbered 1, 2, 3, 4, ..., where every replica stores the same commands in the same order. Each numbered position is called a log slot (or log index). A state machine is just your application: it starts in some state and changes state by applying commands one at a time. The whole trick of practical consensus is called state machine replication (SMR): if every replica starts in the same state and applies the exact same sequence of commands in the exact same order, then every replica ends up in the exact same state. Consensus is the tool that makes all replicas agree on "what is the command in slot 1? slot 2? slot 3?" — agreeing on the log makes the whole system behave like one reliable machine.
6.1 From single-decree to a LOG: Multi-Paxos
What it is. Multi-Paxos is simply: run a separate instance of basic Paxos for each log slot. Slot 1 has its own Paxos run, slot 2 has its own Paxos run, and so on. Each slot independently chooses one value (one command), and once chosen it never changes. Stack the chosen values in slot order and you have your replicated log.
The problem it solves. Basic Paxos agrees on one value with two phases of message round-trips. If you naively ran full Paxos for every command, every single command would pay that full cost. That is wasteful, and it is also dangerous: if many nodes try to be proposer at the same time, they can keep interrupting each other with higher proposal numbers ("dueling proposers") and make no progress. Multi-Paxos fixes both with one idea: elect a stable leader and reuse it.
Quick recap of basic Paxos phases (covered in earlier sections — restated only so the optimization makes sense):
- Phase 1 — Prepare / Promise. A proposer picks a proposal number (also called a ballot number — a unique, ever-increasing tag that says "how recent this attempt is") and sends
Prepare(n)to the acceptors. An acceptor that has not seen anything higher repliesPromise: "I won't accept anything older thann, and here is the most recent value I already accepted, if any." This phase is really about winning the right to propose and learning any value that might already be chosen. - Phase 2 — Accept / Accepted. The proposer sends
Accept(n, value); a majority replyingAcceptedmeans the value is chosen.
How the stable-leader optimization works, step by step. Here is the key insight: Phase 1 does not mention any specific value or slot. It just establishes "I, proposer with ballot n, am now in charge, and here is what's already been accepted." So you can run Phase 1 once for an entire range of future slots, not per command.
- A node decides to become leader. It runs Phase 1 (
Prepare) once with a ballot numbern, but for all log slots from "next empty slot" onward. A majority of acceptors promise. - Now this leader is established. For each new client command, it simply assigns the next free slot and sends only Phase 2 (
Accept(n, slot, command)) to the acceptors. One round trip. No Phase 1. - This continues for command after command — only Phase 2 each time — as long as the leader stays alive and no other node challenges it with a higher ballot.
- If the leader crashes or is suspected dead, some other node runs Phase 1 again with a higher ballot
n+1, becomes the new leader, learns any half-finished slots from the promises, and resumes single-round-trip operation.
Why it matters. In the common, steady-state case (a healthy stable leader), committing a command costs just one round trip to a majority — roughly two message delays — instead of two full phases (about four message delays). That is nearly half the latency, and it removes dueling proposers because there is normally only one active proposer. Phase 1 becomes a rare event that happens only on leader change.
SINGLE-DECREE PAXOS (full protocol, every value pays for both phases)
Proposer Acceptors (majority)
| Prepare(n) ------> | | | Phase 1
| <----- Promise -----| | | (win leadership + learn)
| Accept(n,v) ------> | | | Phase 2
| <----- Accepted ----| | | (value chosen)
=> ~4 message delays PER VALUE
MULTI-PAXOS WITH A STABLE LEADER (Phase 1 done once, then reused)
Leader Acceptors (majority)
| Prepare(n) ------> | | | Phase 1 ── ONCE, for all future slots
| <----- Promise ----| | |
-------------------------------------------------- leader is now stable
| Accept(n,slot1,A)->| | | Phase 2 only (1 round trip) -> slot 1 = A
| <---- Accepted ----| | |
| Accept(n,slot2,B)->| | | Phase 2 only (1 round trip) -> slot 2 = B
| <---- Accepted ----| | |
| Accept(n,slot3,C)->| | | Phase 2 only (1 round trip) -> slot 3 = C
=> ~2 message delays PER VALUE after the one-time setup
Accept(10, slot1, "SET balance=100"); B and C reply Accepted — that is a majority (2 of 3), so slot 1 is committed in one round trip. Next "WITHDRAW 30" goes into slot 2, again one round trip. A never re-runs Phase 1. If A then crashes, B runs Prepare(11) for slots from 3 onward, learns from the promises that slots 1 and 2 are already committed, fills any gaps, and continues as the new leader.6.2 Raft vs Paxos — a fair comparison
Raft (Diego Ongaro and John Ousterhout, 2014, "In Search of an Understandable Consensus Algorithm") was designed with an unusual primary goal: understandability. It provides the same guarantees and the same efficiency as Multi-Paxos — the paper explicitly says Raft "produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos." The difference is structure. Raft starts from a strong leader and a single log, and deliberately reduces the number of moving states you have to reason about.
Two terms you need before the comparison:
- Strong leader. In Raft, log entries flow in one direction only: from the leader out to followers, never the other way. Clients talk to the leader; the leader is the single source of truth for log order. This is stricter than Multi-Paxos, where any node can in principle propose for a slot.
- Term. Raft's name for a ballot/era. A term is a numbered period of time with at most one leader. Terms increase monotonically; each leader election starts a new term. Terms play the same role as Paxos ballot numbers: they let nodes detect stale leaders.
Raft has exactly two RPCs (remote procedure calls — a function call made over the network from one node to another):
RequestVote— sent by a candidate during an election. Key fields:term(the candidate's new term),candidateId,lastLogIndexandlastLogTerm(so voters only elect a candidate whose log is at least as up-to-date as their own — this is the safety rule that keeps committed entries safe).AppendEntries— sent by the leader to replicate log entries and as an empty heartbeat to keep its leadership alive. Key fields:term,leaderId,prevLogIndexandprevLogTerm(the entry immediately before the new ones, used as a consistency check so a follower only accepts entries that line up with what it already has),entries[](the new commands), andleaderCommit(how far the leader has committed).
Notice the mapping: Raft's RequestVote ≈ Paxos Phase 1 (win leadership for a term), and Raft's AppendEntries ≈ Multi-Paxos Phase 2 (commit a command in one round trip). They are the same machine wearing different clothes.
RAFT NORMAL OPERATION (one term, stable leader)
Client Leader (term 4) Follower1 Follower2
| cmd ---> | | |
| | AppendEntries(term=4, prevLogIndex=2,
| | entries=[slot3:cmd], leaderCommit=2) -------> | |
| | <---------- success (matched prevLog) --------- | |
| | majority acked => slot 3 committed
| <-- ok -| apply to state machine, tell followers via leaderCommit
RAFT LOG as boxes (followers must match the leader exactly):
index: 1 2 3 4
+--------+--------+--------+--------+
leader | t1 A | t1 B | t4 C | t4 D |
+--------+--------+--------+--------+
(tN = term in which the entry was created; index = log slot)
| Dimension | Multi-Paxos | Raft |
|---|---|---|
| Primary design goal | Provable correctness; minimal core | Understandability and ease of correct implementation |
| Guarantees / efficiency | Safe under crashes; ~1 round trip with stable leader | Identical safety and efficiency (paper says "equivalent to multi-Paxos, as efficient as Paxos") |
| Leader model | Optional optimization. Leader is a performance trick; any node may propose. Weaker, more flexible leader. | Mandatory strong leader. Entries flow leader→follower only. Simpler to reason about. |
| Log handling | Slots can be chosen out of order; gaps allowed and filled later. More flexible, more bookkeeping. | Log is contiguous and append-only; a follower's log must match the leader exactly up to a point (Log Matching Property). No gaps. |
| Era / freshness tag | Ballot (proposal) numbers | Terms (one leader per term) |
| How it's specified | Core proven in papers; practical Multi-Paxos (log, leader, recovery) under-specified — each team fills gaps | Fully specified end-to-end (election, replication, safety, membership, log compaction) in one paper with a formal proof and reference impls |
| Membership changes | Described abstractly (e.g. use a log entry to change the config); details vary by system | Built-in: joint consensus (overlap old+new config) or single-server add/remove. Spelled out. |
| Real implementations | Chubby, Spanner — built by expert teams; famously hard to get exactly right | etcd, Consul, CockroachDB/TiKV, Kafka KRaft — many independent, interoperable implementations |
| Flexibility ceiling | Higher — leaderless and disjoint-quorum variants (EPaxos, Flexible Paxos) extend the family | Lower by design — the strong-leader constraint trades flexibility for clarity |
The honest summary: Raft and Multi-Paxos give you the same correctness and the same common-case speed. Paxos is the more general, more flexible family but is notoriously easy to get subtly wrong because the practical pieces are under-specified. Raft constrains the design (strong leader, contiguous log) to make a complete, teachable, implementable spec — which is why so many recent systems pick it. "Raft is better" is too strong; "Raft is easier to implement correctly, Paxos is more flexible" is the fair statement.
6.3 Real-world systems and what they use consensus FOR
| System | Algorithm | What it uses consensus for |
|---|---|---|
| etcd | Raft | A distributed key-value store for cluster configuration and coordination; it is the consistent store behind Kubernetes (all cluster state lives here). |
| HashiCorp Consul | Raft | Service discovery, health checks, and a consistent key-value store; Raft keeps the service catalog and config replicated. |
| CockroachDB / TiKV | Raft (per data range) | Distributed SQL databases. Data is split into ranges; each range is a small Raft group replicating its writes, so each shard agrees on its own ordered log. |
| Apache ZooKeeper | ZAB (Paxos-like atomic broadcast) | Coordination service: locks, leader election, config, naming. ZAB (ZooKeeper Atomic Broadcast) is a leader-based protocol — similar in spirit to Multi-Paxos — that orders all updates through one leader. |
| Google Chubby | Paxos | A lock service and small-file store used across Google for leader election and coordination; one of the first big production Paxos deployments. |
| Google Spanner | Paxos (per tablet/shard) | Globally distributed SQL database. Each data shard is replicated by its own Paxos group across data centers, giving consistent, ordered writes worldwide. |
| Apache Kafka (KRaft mode) | Raft | Kafka 3.0+ replaced its ZooKeeper dependency with KRaft (Kafka Raft) to manage cluster metadata — topics, partitions, broker membership — internally via a Raft log. |
Notice a pattern shared by ZAB, ZooKeeper, etcd, Chubby, and Consul: these are coordination services. Applications rarely run consensus directly; instead they lean on one of these systems for the few things that truly need agreement (who is the leader, what is the current config, who holds this lock) and keep their high-volume data path outside consensus.
6.4 Beyond the basics (advanced — intuition only)
EPaxos — leaderless, commutativity-aware Paxos
Egalitarian Paxos (EPaxos) removes the single leader entirely. The motivation: a single leader is a bottleneck and a single point of slowness — clients far from the leader pay a long round trip. EPaxos lets any replica act as coordinator for a command. The clever part uses commutativity: two commands commute if running them in either order gives the same result (e.g. "set x=1" and "set y=2" don't interfere, but "set x=1" and "set x=2" do). When concurrent commands commute, EPaxos commits them in just two message delays (the "fast path") without ordering them relative to each other. Only conflicting commands need extra work to agree on an order (a slower, classic-Paxos-style path). The intuition: don't waste time ordering things whose order doesn't matter. The cost is complexity — tracking dependency sets between commands is intricate and has historically been hard to get right.
Flexible Paxos — quorum intersection, revisited
Classic Paxos uses majority quorums because any two majorities overlap in at least one node, and that overlap is what carries information forward safely. Flexible Paxos (Howard, Malkhi, Spiegelman, 2016) proved a sharper truth: the only intersection that actually matters is between Phase 1 quorums and Phase 2 quorums — not within a phase. So your Phase 1 (leader election) quorums and Phase 2 (commit) quorums must overlap each other, but two Phase 2 quorums need not overlap. Practically, you can shrink the commit quorum (faster, cheaper writes) at the price of a larger election quorum (rarer, so it's fine). The intuition: majorities were a convenient sufficient condition, not a necessary one — you can trade quorum sizes between the two phases.
Byzantine fault tolerance & PBFT — consensus when nodes can LIE
Everything above assumes crash faults: a node either follows the protocol correctly or stops. A Byzantine fault is worse — a faulty node can do anything, including lie, send different messages to different peers, or actively try to corrupt agreement (a compromised or malicious node). PBFT (Practical Byzantine Fault Tolerance), by Castro and Liskov (1999), made this practical. To tolerate f Byzantine nodes it needs 3f+1 total replicas (e.g. tolerate 1 liar with 4 nodes), because honest nodes must out-vote liars even when liars send conflicting stories. Its normal path has three all-to-all phases — pre-prepare (leader proposes order), prepare (replicas cross-confirm they saw the same proposal), and commit (replicas confirm enough others are ready) — each needing about 2f+1 matching messages so the honest majority always intersects. This is far more message-heavy than Raft/Paxos, which is why BFT is reserved for settings where you genuinely don't trust the participants — most famously blockchains, where mutually distrusting parties must still agree on one ledger.
6.5 When do YOU actually need consensus?
Consensus is powerful but expensive, and most application features do not need it. Use this practical ladder:
- Do you need it at all? A single database with one primary already gives you a consistent, ordered source of truth for free. If one Postgres/MySQL primary (with replicas for read scaling or failover) handles your data, you do not need to run your own consensus protocol. Most CRUD apps live here happily.
- Do you need agreement on a small piece of critical metadata? Leader election ("who is the active worker?"), distributed locks, feature flags, service config, cluster membership — these genuinely need consensus. Use an off-the-shelf system: etcd, ZooKeeper, or Consul. They expose simple primitives (a consistent key-value store, locks, leases) and have absorbed years of bug-fixing.
- Do you need a consistent, sharded database across machines/regions? Use a system that builds consensus in for you — CockroachDB, Spanner, TiKV, etcd — rather than wiring raw Raft yourself.
The cost you are paying. Every committed decision in Raft/Multi-Paxos needs a round trip to a majority of nodes. That means: (1) you cannot go faster than the latency to the median node — across regions that's tens to hundreds of milliseconds per write; (2) you need an odd number of nodes (3 or 5 typically) so a majority exists; (3) the system keeps working only while a majority is reachable — lose the majority (e.g. 2 of 3 nodes down) and writes correctly stop rather than risk split-brain. This is the CP corner of CAP you met earlier, made concrete.
acquire("billing-leader", ttl=15s). Whoever holds it runs the job; if that node dies, the lease expires and another node acquires it. You consumed consensus through a one-line API instead of reimplementing leader election (and its dozen subtle failure cases) yourself.Common mistakes & misconceptions
Best practices
Section summary
- Replicated log + state machine replication is the goal: every replica applies the same commands in the same order, so they all reach the same state.
- Multi-Paxos = run basic Paxos per log slot; the key optimization is electing a stable leader so Phase 1 runs once and each command then commits in a single Phase-2 round trip.
- Raft matches Multi-Paxos's guarantees and efficiency but is structured for understandability: a strong leader, a contiguous append-only log, terms, and just two RPCs (
RequestVote,AppendEntries). - Raft's
RequestVote≈ Paxos Phase 1; Raft'sAppendEntries≈ Multi-Paxos Phase 2 — the same machine, different presentation. - Fair verdict: same correctness and speed; Raft is easier to implement correctly, Paxos is more flexible and easier to get subtly wrong.
- Real systems: Raft — etcd, Consul, CockroachDB/TiKV, Kafka KRaft; Paxos/ZAB — Chubby, Spanner, ZooKeeper. Most are coordination services or sharded databases.
- Advanced variants: EPaxos (leaderless, fast path for commutative commands), Flexible Paxos (only Phase-1/Phase-2 quorums must intersect), PBFT (tolerates lying nodes with
3f+1replicas and three phases — used in blockchains). - Use crash-fault consensus (Raft/Paxos) internally; reserve Byzantine fault tolerance for mutually distrusting parties.
- Practical guidance: a single DB primary is often enough; for true agreement use etcd/ZooKeeper/Consul; do not roll your own.
- The price of consensus is a majority round trip per decision and the need for a reachable majority — writes correctly stop rather than risk split-brain.