Multi-Paxos, Raft vs Paxos & the Real World

By Pritesh Yadav 20 min read

So far you have seen consensus as a way to agree on one value. In the previous sections, basic Paxos (single-decree Paxos) agrees on a single decision: one slot, one chosen value, done. But real systems almost never want to agree on just one thing. A database wants to agree on a never-ending stream of operations: "set x=5", then "delete y", then "append to z", in a strict order that every replica applies identically. This section is about how the theory turns into working systems, and how you choose between them.

Let me define the central idea up front. A replicated log is an ordered list of commands, numbered 1, 2, 3, 4, ..., where every replica stores the same commands in the same order. Each numbered position is called a log slot (or log index). A state machine is just your application: it starts in some state and changes state by applying commands one at a time. The whole trick of practical consensus is called state machine replication (SMR): if every replica starts in the same state and applies the exact same sequence of commands in the exact same order, then every replica ends up in the exact same state. Consensus is the tool that makes all replicas agree on "what is the command in slot 1? slot 2? slot 3?" — agreeing on the log makes the whole system behave like one reliable machine.

Analogy: Think of a group of accountants, each keeping their own copy of the same ledger. If they all write down the exact same transactions in the exact same order, all their ledgers will match at the end — even if they never compare totals. Consensus is the rule that decides "what is transaction #1, #2, #3" so that nobody writes a different line on the same row. The state machine is the running balance; the log is the list of transactions.

6.1 From single-decree to a LOG: Multi-Paxos

What it is. Multi-Paxos is simply: run a separate instance of basic Paxos for each log slot. Slot 1 has its own Paxos run, slot 2 has its own Paxos run, and so on. Each slot independently chooses one value (one command), and once chosen it never changes. Stack the chosen values in slot order and you have your replicated log.

The problem it solves. Basic Paxos agrees on one value with two phases of message round-trips. If you naively ran full Paxos for every command, every single command would pay that full cost. That is wasteful, and it is also dangerous: if many nodes try to be proposer at the same time, they can keep interrupting each other with higher proposal numbers ("dueling proposers") and make no progress. Multi-Paxos fixes both with one idea: elect a stable leader and reuse it.

Quick recap of basic Paxos phases (covered in earlier sections — restated only so the optimization makes sense):

  • Phase 1 — Prepare / Promise. A proposer picks a proposal number (also called a ballot number — a unique, ever-increasing tag that says "how recent this attempt is") and sends Prepare(n) to the acceptors. An acceptor that has not seen anything higher replies Promise: "I won't accept anything older than n, and here is the most recent value I already accepted, if any." This phase is really about winning the right to propose and learning any value that might already be chosen.
  • Phase 2 — Accept / Accepted. The proposer sends Accept(n, value); a majority replying Accepted means the value is chosen.

How the stable-leader optimization works, step by step. Here is the key insight: Phase 1 does not mention any specific value or slot. It just establishes "I, proposer with ballot n, am now in charge, and here is what's already been accepted." So you can run Phase 1 once for an entire range of future slots, not per command.

  1. A node decides to become leader. It runs Phase 1 (Prepare) once with a ballot number n, but for all log slots from "next empty slot" onward. A majority of acceptors promise.
  2. Now this leader is established. For each new client command, it simply assigns the next free slot and sends only Phase 2 (Accept(n, slot, command)) to the acceptors. One round trip. No Phase 1.
  3. This continues for command after command — only Phase 2 each time — as long as the leader stays alive and no other node challenges it with a higher ballot.
  4. If the leader crashes or is suspected dead, some other node runs Phase 1 again with a higher ballot n+1, becomes the new leader, learns any half-finished slots from the promises, and resumes single-round-trip operation.

Why it matters. In the common, steady-state case (a healthy stable leader), committing a command costs just one round trip to a majority — roughly two message delays — instead of two full phases (about four message delays). That is nearly half the latency, and it removes dueling proposers because there is normally only one active proposer. Phase 1 becomes a rare event that happens only on leader change.

SINGLE-DECREE PAXOS (full protocol, every value pays for both phases)

 Proposer            Acceptors (majority)
    |  Prepare(n) ------>  |  |  |        Phase 1
    |  <----- Promise -----|  |  |        (win leadership + learn)
    |  Accept(n,v) ------> |  |  |        Phase 2
    |  <----- Accepted ----|  |  |        (value chosen)
    => ~4 message delays PER VALUE


MULTI-PAXOS WITH A STABLE LEADER (Phase 1 done once, then reused)

 Leader              Acceptors (majority)
    |  Prepare(n) ------> |  |  |   Phase 1  ── ONCE, for all future slots
    |  <----- Promise ----|  |  |
    -------------------------------------------------- leader is now stable
    |  Accept(n,slot1,A)->|  |  |   Phase 2 only   (1 round trip) -> slot 1 = A
    |  <---- Accepted ----|  |  |
    |  Accept(n,slot2,B)->|  |  |   Phase 2 only   (1 round trip) -> slot 2 = B
    |  <---- Accepted ----|  |  |
    |  Accept(n,slot3,C)->|  |  |   Phase 2 only   (1 round trip) -> slot 3 = C
    => ~2 message delays PER VALUE after the one-time setup
Example: A 3-node key-value store using Multi-Paxos. Node A wins Phase 1 with ballot 10 and becomes leader. A client sends "SET balance=100". A assigns it slot 1 and sends Accept(10, slot1, "SET balance=100"); B and C reply Accepted — that is a majority (2 of 3), so slot 1 is committed in one round trip. Next "WITHDRAW 30" goes into slot 2, again one round trip. A never re-runs Phase 1. If A then crashes, B runs Prepare(11) for slots from 3 onward, learns from the promises that slots 1 and 2 are already committed, fills any gaps, and continues as the new leader.

Common mistake: Thinking "Multi-Paxos" is a single, fully specified protocol you can copy from a paper. It is not. Lamport's "Paxos Made Simple" describes the leader/log idea in only a paragraph or two and leaves out the hard practical parts: how leader election actually works, how you fill gaps in the log, how you handle a node that fell behind, how you change membership. Every production Multi-Paxos system had to invent and verify these details itself — which is exactly the pain that motivated Raft.

6.2 Raft vs Paxos — a fair comparison

Raft (Diego Ongaro and John Ousterhout, 2014, "In Search of an Understandable Consensus Algorithm") was designed with an unusual primary goal: understandability. It provides the same guarantees and the same efficiency as Multi-Paxos — the paper explicitly says Raft "produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos." The difference is structure. Raft starts from a strong leader and a single log, and deliberately reduces the number of moving states you have to reason about.

Two terms you need before the comparison:

  • Strong leader. In Raft, log entries flow in one direction only: from the leader out to followers, never the other way. Clients talk to the leader; the leader is the single source of truth for log order. This is stricter than Multi-Paxos, where any node can in principle propose for a slot.
  • Term. Raft's name for a ballot/era. A term is a numbered period of time with at most one leader. Terms increase monotonically; each leader election starts a new term. Terms play the same role as Paxos ballot numbers: they let nodes detect stale leaders.

Raft has exactly two RPCs (remote procedure calls — a function call made over the network from one node to another):

  • RequestVote — sent by a candidate during an election. Key fields: term (the candidate's new term), candidateId, lastLogIndex and lastLogTerm (so voters only elect a candidate whose log is at least as up-to-date as their own — this is the safety rule that keeps committed entries safe).
  • AppendEntries — sent by the leader to replicate log entries and as an empty heartbeat to keep its leadership alive. Key fields: term, leaderId, prevLogIndex and prevLogTerm (the entry immediately before the new ones, used as a consistency check so a follower only accepts entries that line up with what it already has), entries[] (the new commands), and leaderCommit (how far the leader has committed).

Notice the mapping: Raft's RequestVote ≈ Paxos Phase 1 (win leadership for a term), and Raft's AppendEntries ≈ Multi-Paxos Phase 2 (commit a command in one round trip). They are the same machine wearing different clothes.

RAFT NORMAL OPERATION (one term, stable leader)

 Client     Leader (term 4)        Follower1   Follower2
   |  cmd ---> |                       |           |
   |          | AppendEntries(term=4, prevLogIndex=2,
   |          |   entries=[slot3:cmd], leaderCommit=2) -------> |  |
   |          | <---------- success (matched prevLog) --------- |  |
   |          |  majority acked  =>  slot 3 committed
   |  <-- ok -|  apply to state machine, tell followers via leaderCommit

 RAFT LOG as boxes (followers must match the leader exactly):
   index:   1        2        3        4
         +--------+--------+--------+--------+
 leader  | t1 A   | t1 B   | t4 C   | t4 D   |
         +--------+--------+--------+--------+
 (tN = term in which the entry was created; index = log slot)
DimensionMulti-PaxosRaft
Primary design goalProvable correctness; minimal coreUnderstandability and ease of correct implementation
Guarantees / efficiencySafe under crashes; ~1 round trip with stable leaderIdentical safety and efficiency (paper says "equivalent to multi-Paxos, as efficient as Paxos")
Leader modelOptional optimization. Leader is a performance trick; any node may propose. Weaker, more flexible leader.Mandatory strong leader. Entries flow leader→follower only. Simpler to reason about.
Log handlingSlots can be chosen out of order; gaps allowed and filled later. More flexible, more bookkeeping.Log is contiguous and append-only; a follower's log must match the leader exactly up to a point (Log Matching Property). No gaps.
Era / freshness tagBallot (proposal) numbersTerms (one leader per term)
How it's specifiedCore proven in papers; practical Multi-Paxos (log, leader, recovery) under-specified — each team fills gapsFully specified end-to-end (election, replication, safety, membership, log compaction) in one paper with a formal proof and reference impls
Membership changesDescribed abstractly (e.g. use a log entry to change the config); details vary by systemBuilt-in: joint consensus (overlap old+new config) or single-server add/remove. Spelled out.
Real implementationsChubby, Spanner — built by expert teams; famously hard to get exactly rightetcd, Consul, CockroachDB/TiKV, Kafka KRaft — many independent, interoperable implementations
Flexibility ceilingHigher — leaderless and disjoint-quorum variants (EPaxos, Flexible Paxos) extend the familyLower by design — the strong-leader constraint trades flexibility for clarity

The honest summary: Raft and Multi-Paxos give you the same correctness and the same common-case speed. Paxos is the more general, more flexible family but is notoriously easy to get subtly wrong because the practical pieces are under-specified. Raft constrains the design (strong leader, contiguous log) to make a complete, teachable, implementable spec — which is why so many recent systems pick it. "Raft is better" is too strong; "Raft is easier to implement correctly, Paxos is more flexible" is the fair statement.

Common mistake: Believing Raft is faster or "more correct" than Paxos. It is neither. The authors deliberately matched Paxos's guarantees and performance; the only intended improvement was understandability. If someone claims a raw throughput win, they are comparing two specific implementations, not the algorithms.

6.3 Real-world systems and what they use consensus FOR

SystemAlgorithmWhat it uses consensus for
etcdRaftA distributed key-value store for cluster configuration and coordination; it is the consistent store behind Kubernetes (all cluster state lives here).
HashiCorp ConsulRaftService discovery, health checks, and a consistent key-value store; Raft keeps the service catalog and config replicated.
CockroachDB / TiKVRaft (per data range)Distributed SQL databases. Data is split into ranges; each range is a small Raft group replicating its writes, so each shard agrees on its own ordered log.
Apache ZooKeeperZAB (Paxos-like atomic broadcast)Coordination service: locks, leader election, config, naming. ZAB (ZooKeeper Atomic Broadcast) is a leader-based protocol — similar in spirit to Multi-Paxos — that orders all updates through one leader.
Google ChubbyPaxosA lock service and small-file store used across Google for leader election and coordination; one of the first big production Paxos deployments.
Google SpannerPaxos (per tablet/shard)Globally distributed SQL database. Each data shard is replicated by its own Paxos group across data centers, giving consistent, ordered writes worldwide.
Apache Kafka (KRaft mode)RaftKafka 3.0+ replaced its ZooKeeper dependency with KRaft (Kafka Raft) to manage cluster metadata — topics, partitions, broker membership — internally via a Raft log.

Notice a pattern shared by ZAB, ZooKeeper, etcd, Chubby, and Consul: these are coordination services. Applications rarely run consensus directly; instead they lean on one of these systems for the few things that truly need agreement (who is the leader, what is the current config, who holds this lock) and keep their high-volume data path outside consensus.

Analogy: Consensus is like the town's official notary. You do not notarize every email you send — that would be unbearably slow and expensive. You notarize only the handful of documents where everyone must later agree the exact wording and order: the deed, the contract, the will. etcd/ZooKeeper/Consul are the notary's office, shared by the whole town, so each application doesn't have to hire its own.

6.4 Beyond the basics (advanced — intuition only)

EPaxos — leaderless, commutativity-aware Paxos

Egalitarian Paxos (EPaxos) removes the single leader entirely. The motivation: a single leader is a bottleneck and a single point of slowness — clients far from the leader pay a long round trip. EPaxos lets any replica act as coordinator for a command. The clever part uses commutativity: two commands commute if running them in either order gives the same result (e.g. "set x=1" and "set y=2" don't interfere, but "set x=1" and "set x=2" do). When concurrent commands commute, EPaxos commits them in just two message delays (the "fast path") without ordering them relative to each other. Only conflicting commands need extra work to agree on an order (a slower, classic-Paxos-style path). The intuition: don't waste time ordering things whose order doesn't matter. The cost is complexity — tracking dependency sets between commands is intricate and has historically been hard to get right.

Flexible Paxos — quorum intersection, revisited

Classic Paxos uses majority quorums because any two majorities overlap in at least one node, and that overlap is what carries information forward safely. Flexible Paxos (Howard, Malkhi, Spiegelman, 2016) proved a sharper truth: the only intersection that actually matters is between Phase 1 quorums and Phase 2 quorums — not within a phase. So your Phase 1 (leader election) quorums and Phase 2 (commit) quorums must overlap each other, but two Phase 2 quorums need not overlap. Practically, you can shrink the commit quorum (faster, cheaper writes) at the price of a larger election quorum (rarer, so it's fine). The intuition: majorities were a convenient sufficient condition, not a necessary one — you can trade quorum sizes between the two phases.

Byzantine fault tolerance & PBFT — consensus when nodes can LIE

Everything above assumes crash faults: a node either follows the protocol correctly or stops. A Byzantine fault is worse — a faulty node can do anything, including lie, send different messages to different peers, or actively try to corrupt agreement (a compromised or malicious node). PBFT (Practical Byzantine Fault Tolerance), by Castro and Liskov (1999), made this practical. To tolerate f Byzantine nodes it needs 3f+1 total replicas (e.g. tolerate 1 liar with 4 nodes), because honest nodes must out-vote liars even when liars send conflicting stories. Its normal path has three all-to-all phases — pre-prepare (leader proposes order), prepare (replicas cross-confirm they saw the same proposal), and commit (replicas confirm enough others are ready) — each needing about 2f+1 matching messages so the honest majority always intersects. This is far more message-heavy than Raft/Paxos, which is why BFT is reserved for settings where you genuinely don't trust the participants — most famously blockchains, where mutually distrusting parties must still agree on one ledger.

Common mistake: Reaching for Byzantine fault tolerance (PBFT, blockchain-style consensus) inside your own data center. If you run all the nodes yourself, they don't lie — they crash. Crash-fault consensus (Raft/Paxos) is the right tool, and it is dramatically cheaper. BFT pays a steep cost (more nodes, more messages, cryptography) to solve a problem you almost certainly don't have internally.

6.5 When do YOU actually need consensus?

Consensus is powerful but expensive, and most application features do not need it. Use this practical ladder:

  1. Do you need it at all? A single database with one primary already gives you a consistent, ordered source of truth for free. If one Postgres/MySQL primary (with replicas for read scaling or failover) handles your data, you do not need to run your own consensus protocol. Most CRUD apps live here happily.
  2. Do you need agreement on a small piece of critical metadata? Leader election ("who is the active worker?"), distributed locks, feature flags, service config, cluster membership — these genuinely need consensus. Use an off-the-shelf system: etcd, ZooKeeper, or Consul. They expose simple primitives (a consistent key-value store, locks, leases) and have absorbed years of bug-fixing.
  3. Do you need a consistent, sharded database across machines/regions? Use a system that builds consensus in for you — CockroachDB, Spanner, TiKV, etcd — rather than wiring raw Raft yourself.
Best practice: Do NOT roll your own consensus. This is the single most repeated piece of advice in the field. Even the experts who wrote production Paxos (Google's Chubby team) reported that turning the paper into correct, fault-injecting, real code was far harder than expected. Reach for a battle-tested library or service. Writing your own is a multi-year correctness project, not a sprint task.

The cost you are paying. Every committed decision in Raft/Multi-Paxos needs a round trip to a majority of nodes. That means: (1) you cannot go faster than the latency to the median node — across regions that's tens to hundreds of milliseconds per write; (2) you need an odd number of nodes (3 or 5 typically) so a majority exists; (3) the system keeps working only while a majority is reachable — lose the majority (e.g. 2 of 3 nodes down) and writes correctly stop rather than risk split-brain. This is the CP corner of CAP you met earlier, made concrete.

Example: A team wants "only one worker should process the nightly billing job." Tempting wrong answer: invent a custom heartbeat-and-flag scheme in your app database. Right answer: take a lease/lock from etcd or ZooKeeper — acquire("billing-leader", ttl=15s). Whoever holds it runs the job; if that node dies, the lease expires and another node acquires it. You consumed consensus through a one-line API instead of reimplementing leader election (and its dozen subtle failure cases) yourself.

Key takeaway: Multi-Paxos and Raft are two roads to the same destination — an agreed-upon replicated log, committed in one majority round trip in the steady state via a stable leader. Raft trades flexibility for a complete, understandable spec; Paxos is more general but easier to get wrong. In practice you should almost never implement either: pick etcd/ZooKeeper/Consul for coordination, or a consensus-backed database for data, and reserve consensus for the few decisions that truly require everyone to agree.

Common mistakes & misconceptions

Common mistake: Treating "Multi-Paxos" as a single, copy-pasteable algorithm. The paper sketches the leader/log idea but omits the hard practical machinery (election, gap filling, recovery, membership). Real systems had to design and verify all of that themselves.

Common mistake: Thinking Raft is fundamentally faster or safer than Paxos. They were engineered to give identical guarantees and identical common-case performance. The only deliberate difference is understandability.

Common mistake: Re-running Paxos Phase 1 for every command. The whole point of the stable-leader optimization is to run Phase 1 once per leadership term and then use only Phase 2 (one round trip) for each subsequent command.

Common mistake: Using Byzantine fault tolerance (PBFT/blockchain consensus) for internal infrastructure. Your own nodes crash, they don't lie — crash-fault consensus (Raft/Paxos) is correct and far cheaper. BFT is for mutually distrusting parties.

Common mistake: Rolling your own consensus library for a production system. Even expert teams found turning the papers into correct, fault-tested code brutally hard. Use a proven system instead.

Common mistake: Putting your high-volume data path through consensus when you don't need to. Consensus costs a majority round trip per write. Most apps need agreement only on a tiny slice of critical metadata (leader, lock, config), not on every request.

Best practices

Best practice: Use an off-the-shelf coordination service (etcd, ZooKeeper, Consul) for leader election, locks, membership, and config. Consume consensus through a simple lease/key-value API rather than reimplementing it.

Best practice: If you only need a consistent ordered source of truth, a single database primary (with replicas) is often enough. Reach for distributed consensus only when one primary genuinely cannot satisfy your availability or geographic needs.

Best practice: Run an odd number of voting nodes (3 or 5). It maximizes fault tolerance per node and avoids ties when forming a majority. Five nodes tolerate two failures; three tolerate one.

Best practice: Budget for the latency of a majority round trip, especially across regions. Keep latency-sensitive, high-throughput work outside the consensus path; put only the must-agree decisions inside it.

Best practice: When you do need a consensus-backed database, pick one that builds it in (CockroachDB, Spanner, TiKV, etcd) so the membership-change, recovery, and log-compaction details are handled and tested for you.

Section summary

  • Replicated log + state machine replication is the goal: every replica applies the same commands in the same order, so they all reach the same state.
  • Multi-Paxos = run basic Paxos per log slot; the key optimization is electing a stable leader so Phase 1 runs once and each command then commits in a single Phase-2 round trip.
  • Raft matches Multi-Paxos's guarantees and efficiency but is structured for understandability: a strong leader, a contiguous append-only log, terms, and just two RPCs (RequestVote, AppendEntries).
  • Raft's RequestVote ≈ Paxos Phase 1; Raft's AppendEntries ≈ Multi-Paxos Phase 2 — the same machine, different presentation.
  • Fair verdict: same correctness and speed; Raft is easier to implement correctly, Paxos is more flexible and easier to get subtly wrong.
  • Real systems: Raft — etcd, Consul, CockroachDB/TiKV, Kafka KRaft; Paxos/ZAB — Chubby, Spanner, ZooKeeper. Most are coordination services or sharded databases.
  • Advanced variants: EPaxos (leaderless, fast path for commutative commands), Flexible Paxos (only Phase-1/Phase-2 quorums must intersect), PBFT (tolerates lying nodes with 3f+1 replicas and three phases — used in blockchains).
  • Use crash-fault consensus (Raft/Paxos) internally; reserve Byzantine fault tolerance for mutually distrusting parties.
  • Practical guidance: a single DB primary is often enough; for true agreement use etcd/ZooKeeper/Consul; do not roll your own.
  • The price of consensus is a majority round trip per decision and the need for a reachable majority — writes correctly stop rather than risk split-brain.

Continue reading