Replicated State Machines & the Log

By Pritesh Yadav 25 min read

This section gives you the one mental model that every practical consensus system — Raft, Paxos, ZooKeeper's ZAB, etcd, Consul, Google's Chubby and Spanner — is built on. If you understand this section deeply, the later sections on Raft and Paxos become "two different ways to do the same thing." Almost everything those protocols do is in service of the idea you are about to learn.

The idea has a name: the Replicated State Machine, usually shortened to RSM. Let's build it up slowly.

2.1 What is a "state machine"? (the plain version)

A state machine is just a fancy name for a thing that has some current state (some data) and a fixed set of commands that change that state. You feed it a command, it updates its state, and it may give back a result. That's it.

The crucial property we care about is this: a state machine is deterministic. Deterministic means "no randomness, no surprises" — given the same starting state and the same command, it always produces the exact same new state and the exact same result. Every single time. No reading the clock, no random numbers, no peeking at the network — just (old state + command) → (new state + result).

Example: Think of a tiny in-memory key-value store (a "KV store") — basically a dictionary. Its state is a map like { "x": 1, "y": 5 }. Its commands are things like SET x 10, SET y 7, DELETE x, INCR y. Start it empty, apply SET x 1 then SET x 2 then INCR x, and you will always end with { "x": 3 }. The result never depends on luck — only on the commands and the order they arrive in.

2.2 The core insight: same start + same commands + same order = same end

Now copy that KV store onto five different servers. Call each copy a replica (you already know this word from the foundations guide — it's an independent server holding its own full copy of the data). We want all five replicas to hold identical data, forever, even though servers crash, messages get lost, and there is no global clock.

Here is the whole trick, and it is almost embarrassingly simple:

Key takeaway: If every replica (1) starts in the same initial state, and (2) applies the same commands in the same order, then because each replica is deterministic, they all pass through exactly the same sequence of states and end up identical. No replica ever needs to copy another's data; they only need to agree on the list of commands.

Stare at that for a second, because it changes the whole problem. We started with a hard-sounding goal — "keep five copies of a database in sync across an unreliable network" — and we just reduced it to a different goal: "make all five servers agree on one single, ordered list of commands." If they agree on the list, identical data falls out for free.

That ordered list of commands has a name: the log. And "make everyone agree on the ordered log" is exactly what a consensus algorithm does. So:

        THE BIG REDUCTION

  "Keep N database replicas identical"        (hard, vague)
                  │
                  │  because every replica is a
                  │  DETERMINISTIC state machine
                  ▼
  "Make all N replicas agree on ONE ordered   (precise!)
   list of commands — the LOG"
                  │
                  │  this exact thing is...
                  ▼
            C O N S E N S U S
Analogy: Imagine five accountants in five different cities, each keeping their own copy of the same ledger by hand. You do not fax them photos of your ledger pages to keep them in sync — that's slow and error-prone. Instead, every accountant works from the same numbered list of instructions: "#1: deposit $100. #2: withdraw $30. #3: transfer $20…". As long as they all follow the same numbered instructions in the same order, and each does the arithmetic the same way (they're deterministic), their ledgers are guaranteed to match — without anyone ever comparing ledgers. Agreeing on the numbered instruction list is the only hard part. That list is the log.

2.3 The replicated log, drawn as boxes

The log is an append-only list of entries. "Append-only" means you only ever add to the end — you never insert in the middle and never edit a committed entry. (Followers can have wrong, uncommitted entries overwritten, as we'll see — but agreed-upon history is immutable.)

Each log entry carries three things:

  • An index — the entry's position in the list: 1, 2, 3, 4, … Think of it as the line number. It defines the order.
  • A command — the actual instruction for the state machine, e.g. SET x 3.
  • A term (Raft's word) / epoch (ZooKeeper's word) / ballot (Paxos's word) — a number telling you "which leadership period created this entry." We'll explain terms fully in §2.5; for now just know each entry is stamped with one.

Here is one server's log drawn as boxes. Top number = term, bottom = command. The index runs left to right:

 index:    1        2        3        4        5
        ┌────────┬────────┬────────┬────────┬────────┐
        │ term 1 │ term 1 │ term 2 │ term 3 │ term 3 │
 LOG →  │ SET x 1│ SET y 5│ INCR x │ DEL y  │ SET z 9│
        └────────┴────────┴────────┴────────┴────────┘
                                    ▲
                                 commitIndex = 4
                                 (entries 1–4 are COMMITTED & safe;
                                  entry 5 is only proposed, not yet safe)

2.4 "Committed" vs "applied" — two different milestones

This distinction trips people up constantly, so we'll be precise. An entry passes through milestones on its way to being real:

  1. Appended — the entry exists in some server's log, but might still be lost or overwritten. Not safe yet.
  2. Committed — the entry has been safely replicated (stored) on a majority of servers. "Majority" means more than half — for 5 servers, at least 3. Once committed, the entry is permanent: it can never be lost or changed, even if servers crash, because any future majority must overlap with this majority on at least one server that has the entry. Committed = "this decision is final."
  3. Applied — the command has actually been run against the local state machine, changing the data (the KV map). Applied = "this decision has taken effect on my copy."

Two pieces of per-server state track these milestones (the real field names from Raft):

  • commitIndex — the highest index known to be committed.
  • lastApplied — the highest index this server has actually applied to its state machine.

The rule every server runs in a loop is simply: whenever commitIndex > lastApplied, increment lastApplied and apply that next log entry to the state machine. So "applied" always chases "committed." Commit is the agreement; apply is the side effect.

Common mistake: Thinking "committed" and "applied" are the same thing. They are not. An entry becomes committed the instant a majority has it stored — a purely logical fact about replication. It becomes applied later, when each server gets around to running it against its data. A follower can know entry 4 is committed (it saw a higher leaderCommit from the leader) but not yet have applied it. Commit decides "this is final and ordered"; apply decides "my data now reflects it." Only committed entries may ever be applied — never an uncommitted one, because an uncommitted entry could still be thrown away.
Key takeaway: Commit is about agreement and durability (a majority stored it → it's final). Apply is about effect (run it on the state machine → the data changes). Commit comes first and triggers apply. commitIndex tracks the first; lastApplied tracks the second; lastApplied never gets ahead of commitIndex.

2.5 Term / epoch / ballot numbers — a logical clock for leadership

Recall from the foundations guide: there is no global clock, so we can't order events by wall-clock time. We need a logical way to say "this happened in a later leadership period than that." That's what the term number is.

A term (Raft) — equivalently an epoch (ZAB) or ballot number (Paxos) — is a monotonically increasing integer that labels each period of leadership. "Monotonically increasing" just means it only ever goes up, never down: 1, 2, 3, 4, … Every time the cluster holds a new election to pick a leader, it bumps the number. A new election = a new term.

   TIME (logical, not wall-clock) ───────────────────────────────►

   ┌───── term 1 ─────┐┌─ term 2 ─┐┌──────── term 3 ────────┐
   │ leader = S1      ││ election ││ leader = S4            │
   │ (normal work)    ││ (no ldr) ││ (normal work)          │
   └──────────────────┘└──────────┘└────────────────────────┘
       term goes 1 ──────────► 2 ──────────► 3   (only ever UP)

   Some terms have a leader; some terms are just a failed election
   (split vote) with no leader — then the cluster moves to the next term.

Why does a single ever-increasing number matter so much? Because it lets every server settle disputes locally with one comparison, and it cleanly evicts stale leaders. Two rules do almost all the work:

  • Higher term always wins. Every message (every RPC — "Remote Procedure Call," a function call sent over the network to another server) carries its sender's term. When a server receives a message with a term higher than its own currentTerm, it adopts that higher term and immediately steps down to being a plain follower. When a server receives a message with a term lower than its own, it rejects it — "you're out of date, ignore me."
  • One vote per term. A server stores who it voted for this term in a field called votedFor, so it can never vote twice in the same term. Combined with "majority needed to win," this guarantees at most one leader per term — two candidates can't both collect a majority from the same set of voters in the same term.

This is precisely how the system avoids the nightmare of two leaders clashing. Suppose an old leader from term 2 was network-partitioned (cut off) for a while, and meanwhile the rest elected a new leader in term 3. When the partition heals and the old term-2 leader tries to push entries, every healthy server sees "term 2 < my term 3" and rejects it. The old leader, on hearing a reply carrying term 3, sees the higher number and steps down. No corruption — the higher term simply wins.

Analogy: Term numbers are like the regnal numbers of monarchs — "Henry I, Henry II, Henry III." Everyone obeys the highest-numbered monarch they've heard of. If a deposed king (Henry II) wanders back into the throne room issuing orders, the guards just say "Sorry, we serve Henry III now" and ignore him. The instant Henry II hears the number "III," he knows he's been superseded and stands down. The number alone — not a clock, not a vote recount — resolves who is legitimately in charge.
Key takeaway: The term/epoch/ballot number is the logical clock for leadership. It is monotonic (only goes up), it stamps every entry and every message, "higher term wins / lower term is rejected," and "one vote per term + majority" gives at most one leader per term. This single integer is what makes a stale, returning leader harmless instead of catastrophic.

2.6 Leader-based vs leaderless — why nearly everyone elects a leader

There are two broad ways to agree on the log:

ApproachHow it worksTrade-offReal systems
Leader-basedThe cluster elects one server as leader. The leader alone decides the order of entries and pushes them to followers. All writes funnel through the leader.Simple to reason about; one server sequences everything, so the order is obvious. Downside: the leader is a bottleneck and a single point of (re-elected) failure — if it dies you pause for an election.Raft, Multi-Paxos (with a stable leader), ZooKeeper/ZAB, etcd, Consul, Chubby
LeaderlessNo fixed leader; any server can propose, and servers run a voting round (or several) per individual log slot to agree on what goes there.No single bottleneck and no election pause. Downside: dueling proposers — two servers proposing for the same slot can keep interrupting each other, causing extra rounds and stalls (livelock).Basic (single-decree) Paxos; Egalitarian Paxos (EPaxos)

Why do almost all production systems pick a leader? Because once you have a single leader, the order of the log is trivially decided — it's just "whatever order the leader received the requests in." There is no argument about sequencing, because one server is doing all the sequencing. This makes the system far easier to understand, debug, and reason about, and it makes the common case fast: a stable leader can skip the expensive "who's in charge?" negotiation and just stream entries. (This is exactly the optimization Multi-Paxos makes over basic Paxos — run the leader-election "prepare" phase once, then reuse that leadership to commit many log entries with a single round-trip each. Raft bakes a strong leader in from the start.)

2.7 How a client actually talks to the system

Now put yourself in the shoes of a client — your web application that wants to write data. The normal happy path in a leader-based system is:

  CLIENT          LEADER (S1)            FOLLOWERS (S2, S3, S4, S5)
    │                 │                          │   │   │   │
    │  SET x 3        │                          │   │   │   │
    ├────────────────►│  1. append to own log    │   │   │   │
    │                 │  (index 5, term 3)       │   │   │   │
    │                 │                          │   │   │   │
    │                 │  2. AppendEntries RPC    │   │   │   │
    │                 ├─────────────────────────►│   │   │   │   (replicate)
    │                 ├─────────────────────────────►│   │   │
    │                 ├─────────────────────────────────►│   │
    │                 │◄───── ACK ───────────────┤   │   │   │
    │                 │◄───── ACK ───────────────────┤   │   │
    │                 │                          │   │   │   │
    │                 │  3. majority (3/5) stored │   │   │   │
    │                 │     → COMMIT index 5      │   │   │   │
    │                 │  4. apply to state machine│   │   │   │
    │   result: OK    │                          │   │   │   │
    │◄────────────────┤  5. reply to client      │   │   │   │
    │                 │                          │   │   │   │
  1. The client sends its command to the leader. (If it accidentally contacts a follower, the follower redirects it to the leader.)
  2. The leader appends the command to its own log as a new entry (next index, current term).
  3. The leader replicates the entry to followers via the AppendEntries RPC (covered in §2.8).
  4. Once a majority have stored it, the leader marks it committed and applies it to its own state machine.
  5. The leader replies to the client with the result. Followers learn the entry is committed from the leader's next message and apply it too.

Why clients need idempotency / de-duplication

Here's a subtle but critical problem. Suppose the leader commits your INCR x, applies it, and then crashes before the reply reaches you. Your client times out and retries. A new leader is elected — and now your INCR x risks being applied twice, turning a +1 into a +2. On a replicated state machine, a double-apply corrupts the data on every replica identically, so you can't even detect it by comparing replicas.

The fix is idempotency via de-duplication. Idempotent means "applying it twice has the same effect as applying it once." The standard technique:

  • The client opens a session and tags every command with a unique, increasing serial number (a request id) — e.g. { requestId: 1742, cmd: "INCR x" }.
  • The state machine remembers, per client, the highest serial number it has already applied and the result it returned.
  • If a command arrives whose serial number was already applied, the system does not re-run it — it just returns the remembered result.

This turns "at-least-once delivery" (the network may deliver your retry more than once) into "exactly-once effect" (the command changes state only once). This is what people loosely call "exactly-once" — it's really at-least-once delivery + idempotent processing.

Analogy: Online payment links use an "idempotency key." If your browser stutters and re-submits the payment form, the bank sees the same key it already processed and replies "already done — here's the original receipt" instead of charging your card twice. The serial number in a consensus client session is exactly that idempotency key, applied to log commands.
Common mistake: Assuming that because consensus gives you a single agreed log, you automatically get exactly-once execution. You don't — not without client de-duplication. Consensus guarantees the log is the same everywhere; it does not guarantee your retried request appears in that log only once. Two copies of the same retried command are two distinct, validly-committed entries unless you tag requests with ids and have the state machine skip duplicates.

2.8 The two RPCs that move the log (Raft's concrete fields)

To make this real, here are the exact fields of the two messages a leader-based system uses. We use Raft's names because they're the clearest; Paxos and ZAB have equivalents.

AppendEntries — replicate log entries (and act as a heartbeat)

Sent by the leader to followers to copy entries. When sent with an empty entries[] list, it doubles as a heartbeat ("I'm still alive, don't start an election").

FieldMeaning
termThe leader's current term. Lets a follower reject a stale leader.
leaderIdWho the leader is, so followers can redirect clients to it.
prevLogIndexThe index of the entry immediately before the new ones — the "glue point."
prevLogTermThe term of the entry at prevLogIndex. Together with prevLogIndex this is the consistency check.
entries[]The new log entries to store (empty for a heartbeat).
leaderCommitThe leader's commitIndex, so the follower can learn which entries are now committed and may be applied.

RequestVote — ask other servers to elect you leader

Sent by a candidate (a server trying to become leader) during an election.

FieldMeaning
termThe candidate's new term (it bumped the number to start this election).
candidateIdWho is asking for the vote.
lastLogIndexIndex of the candidate's last log entry — used to prove its log is "up-to-date."
lastLogTermTerm of the candidate's last log entry — the primary measure of "up-to-date."

(The leader also keeps two per-follower bookkeeping arrays: nextIndex[] — the next index it will try to send each follower — and matchIndex[] — the highest index it knows is safely replicated on each follower. These drive the repair process below.)

2.9 The Log Matching Property and repairing a diverging follower

Followers can fall behind or even hold wrong entries — for example, a server that briefly followed a leader who crashed before its entries got committed. We need a way to force every follower's log back into agreement with the leader's, safely. Raft guarantees this with the Log Matching Property:

Key takeaway (Log Matching Property): If two logs contain an entry at the same index with the same term, then (a) those two entries store the identical command, AND (b) all entries before them are identical too. So matching just one (index, term) pair proves the entire history up to that point matches. This is what makes repair efficient: find one matching point and everything before it is already correct.

How the consistency check enforces it: every AppendEntries says "before these new entries, you should have an entry at prevLogIndex with term prevLogTerm." The follower checks: do I actually have that exact entry? If yes, the leader's history matches mine up to that point — accept the new entries. If no, reject the RPC. On rejection the leader decrements nextIndex for that follower and retries with an earlier prevLogIndex, walking backwards one step at a time until it finds the last point where the two logs agree. From there it ships everything forward, and the follower overwrites any conflicting entries with the leader's.

Here is the repair, drawn. Server S3 has a diverged tail (a leftover term 2 entry at index 3 that never committed). The leader (term 3) walks nextIndex back to the agreement point and overwrites:

  BEFORE REPAIR
  index:        1        2        3        4        5
  LEADER  S1: [T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]   ← source of truth
  follower S2:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]   ✓ already matches
  follower S3:[T1 a ][T1 b ][T2 x ]                 ✗ index 3 term differs
  follower S4:[T1 a ][T1 b ]                         (just behind, no conflict)

  Leader sends AppendEntries to S3 with prevLogIndex=4 ... reject (S3 has no #4)
                          ... prevLogIndex=3, prevLogTerm=3 ... reject (S3 has T2 not T3)
                          ... prevLogIndex=2, prevLogTerm=1 ... MATCH! (both [T1 b])
  → agreement point found at index 2. Leader ships indexes 3,4,5.
    S3 OVERWRITES its bad [T2 x] at index 3.

  AFTER REPAIR
  index:        1        2        3        4        5
  LEADER  S1: [T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]
  follower S2:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]   ✓
  follower S3:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]   ✓ repaired (T2 x discarded)
  follower S4:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]   ✓ caught up
            └──────── all logs now IDENTICAL ────────┘
       → apply in order → all state machines IDENTICAL
Example: Walk S3 above. Its bad entry [T2 x] at index 3 came from a previous leader (term 2) that died before that entry reached a majority — so it was never committed, and discarding it loses nothing anyone was promised. The leader's backward walk stops at index 2, where both logs hold [T1 b]. By the Log Matching Property, everything at index 1–2 is therefore already identical, so the leader only needs to resend from index 3. S3 overwrites [T2 x] with [T3 c] and copies [T3 d][T3 e]. Now all five logs are byte-for-byte equal — and because each server applies them in index order to a deterministic state machine, all five KV stores end up holding exactly the same data.
Analogy: Two people are syncing the same novel manuscript. Instead of comparing every page, they flip backwards together until they hit a page that is word-for-word identical in both copies ("ah, page 84 matches"). Everything up to page 84 is therefore already in sync, so they only need to recopy from page 85 onward — and the person who is behind tears out their later, conflicting pages and replaces them with the authoritative version. That "flip back to the last matching page, then overwrite forward" is exactly nextIndex repair.

2.10 The two jobs of every consensus protocol

Everything above collapses into two jobs that every leader-based consensus protocol must do. The rest of this study guide — all of Raft, all of Multi-Paxos, all of ZAB — is just two different ways of doing these same two jobs:

  1. Agree on a leader (leader election). Pick exactly one server to sequence the log, and make sure stale leaders are harmlessly evicted. The term/epoch/ballot number plus majority voting are the tools that make "at most one leader per term" true. (This is what RequestVote and terms are for.)
  2. Replicate the log safely (log replication). Get every committed entry onto a majority, in the same index order, and repair any follower that diverges — so that "committed" entries are permanent and identical everywhere. (This is what AppendEntries, the consistency check, commitIndex, and nextIndex repair are for.)
   ┌─────────────────────────────────────────────────────────┐
   │                  CONSENSUS PROTOCOL                       │
   │                                                           │
   │   JOB 1: ELECT A LEADER       JOB 2: REPLICATE THE LOG    │
   │   ─────────────────────       ─────────────────────────   │
   │   • bump term/epoch           • leader appends entry      │
   │   • RequestVote                • AppendEntries to followers │
   │   • need majority of votes    • majority stored → COMMIT  │
   │   • one vote per term         • consistency check + repair│
   │   • higher term wins          • apply in index order      │
   │            │                              │                │
   │            └──────────────┬───────────────┘                │
   │                           ▼                                │
   │        identical ordered logs on every replica            │
   │                           ▼                                │
   │   identical deterministic state machines = same data      │
   └─────────────────────────────────────────────────────────┘

Common mistakes & misconceptions

Common mistake: Believing replicas keep in sync by copying each other's data. They don't. They copy the ordered log of commands and each replays it independently. Identical data is a consequence of identical logs plus deterministic state machines — never something that's directly shipped around.
Common mistake: Letting non-determinism leak into the state machine. If a command does anything that varies between servers — reads the local clock, generates a random id, calls an external API, depends on map-iteration order — the replicas diverge even with identical logs. Any such value must be decided once by the leader and written into the log entry (e.g. log "set createdAt = 2026-06-21T10:00:00Z", not "set createdAt = now()"), so every replica applies the same fixed value.
Common mistake: Confusing "committed" with "applied." Committed = a majority stored it, so it is final and ordered. Applied = a server has run it against its data. Commit happens first and is the agreement; apply is the later side effect. Only committed entries may be applied, and lastApplied never overtakes commitIndex.
Common mistake: Thinking a returning old leader can corrupt data. It can't, thanks to terms. Its messages carry a lower term, so healthy servers reject them, and the moment it hears a higher term in any reply it steps down. The monotonic term number — not timing or luck — is what makes stale leaders harmless.
Common mistake: Assuming consensus alone gives exactly-once execution. Consensus makes the log identical everywhere; it does not stop a retried client request from being committed twice. You need client sessions with request ids (idempotency keys) and a state machine that skips already-applied serial numbers.
Common mistake: Treating the whole log as immutable. Committed history is immutable, yes — but a follower's uncommitted tail can be legitimately overwritten by the leader during repair. Those discarded entries were never committed, so no client was ever told they succeeded, and nothing is lost.

Best practices

Best practice: Keep your state machine strictly deterministic. Forbid clock reads, randomness, and external calls inside command application. Resolve every such value at the leader and bake it into the log entry, so all replicas apply identical inputs.
Best practice: Always give clients sessions with monotonic request ids, and have the state machine record "highest serial applied + result returned" per client. This is the cheap, standard way to turn unavoidable retries into exactly-once effects and avoid silent double-applies.
Best practice: Use an odd cluster size (3, 5, 7). Majority for 3 is 2, for 5 is 3 — an odd count maximizes the failures you tolerate per node you pay for (5 nodes tolerate 2 failures; 6 nodes also only tolerate 2). It also avoids tie votes.
Best practice: Never let unbounded logs grow forever. Pair the log with snapshots (a compact dump of the state machine at some index) so old, already-applied entries can be discarded and a lagging or new follower can be caught up by snapshot + tail instead of replaying from index 1.
Best practice: Route all writes through the leader and accept the redirect cost. Resist "let any node accept writes" shortcuts — single-leader sequencing is what makes the log order unambiguous and the system easy to reason about, which is why etcd, ZooKeeper, Consul, and Chubby all do it.

Section summary

  • A Replicated State Machine (RSM): identical deterministic state machines that start equal and apply the same commands in the same order stay equal forever — so consensus reduces to "agree on the ordered log."
  • The log is an append-only list of entries; each entry has an index (order), a command, and a term/epoch/ballot (which leadership period created it).
  • Committed (a majority has stored it → final and permanent) is distinct from applied (run against the local state machine); commitIndex tracks the first, lastApplied the second, and apply only ever follows commit.
  • Term/epoch/ballot is a monotonic logical clock for leadership: "a new election = a new term," higher term wins, lower term is rejected, one vote per term → at most one leader per term, and stale leaders are evicted harmlessly.
  • Almost all production systems are leader-based (Raft, Multi-Paxos, ZAB/ZooKeeper, etcd, Consul, Chubby) because one leader makes log order trivial and the common case fast; leaderless designs avoid the bottleneck but risk dueling proposers.
  • Clients send commands to the leader, which appends → replicates (AppendEntries) → commits on majority → applies → replies; followers learn the commit point from leaderCommit.
  • Idempotency via request ids / client sessions is required to stop retried commands from being applied twice — consensus alone does not give exactly-once execution.
  • The Log Matching Property (same index + same term ⇒ identical entry and identical history before it) plus the prevLogIndex/prevLogTerm consistency check let a leader repair a diverging follower by walking nextIndex back to the agreement point and overwriting its uncommitted tail.
  • Every consensus protocol has exactly two jobs: (1) agree on a leader and (2) replicate the log safely — Raft and Paxos are just two ways of doing these same two things.

Continue reading