Replicated State Machines & the Log
This section gives you the one mental model that every practical consensus system — Raft, Paxos, ZooKeeper's ZAB, etcd, Consul, Google's Chubby and Spanner — is built on. If you understand this section deeply, the later sections on Raft and Paxos become "two different ways to do the same thing." Almost everything those protocols do is in service of the idea you are about to learn.
The idea has a name: the Replicated State Machine, usually shortened to RSM. Let's build it up slowly.
2.1 What is a "state machine"? (the plain version)
A state machine is just a fancy name for a thing that has some current state (some data) and a fixed set of commands that change that state. You feed it a command, it updates its state, and it may give back a result. That's it.
The crucial property we care about is this: a state machine is deterministic. Deterministic means "no randomness, no surprises" — given the same starting state and the same command, it always produces the exact same new state and the exact same result. Every single time. No reading the clock, no random numbers, no peeking at the network — just (old state + command) → (new state + result).
{ "x": 1, "y": 5 }. Its commands are things like SET x 10, SET y 7, DELETE x, INCR y. Start it empty, apply SET x 1 then SET x 2 then INCR x, and you will always end with { "x": 3 }. The result never depends on luck — only on the commands and the order they arrive in.2.2 The core insight: same start + same commands + same order = same end
Now copy that KV store onto five different servers. Call each copy a replica (you already know this word from the foundations guide — it's an independent server holding its own full copy of the data). We want all five replicas to hold identical data, forever, even though servers crash, messages get lost, and there is no global clock.
Here is the whole trick, and it is almost embarrassingly simple:
Stare at that for a second, because it changes the whole problem. We started with a hard-sounding goal — "keep five copies of a database in sync across an unreliable network" — and we just reduced it to a different goal: "make all five servers agree on one single, ordered list of commands." If they agree on the list, identical data falls out for free.
That ordered list of commands has a name: the log. And "make everyone agree on the ordered log" is exactly what a consensus algorithm does. So:
THE BIG REDUCTION
"Keep N database replicas identical" (hard, vague)
│
│ because every replica is a
│ DETERMINISTIC state machine
▼
"Make all N replicas agree on ONE ordered (precise!)
list of commands — the LOG"
│
│ this exact thing is...
▼
C O N S E N S U S
2.3 The replicated log, drawn as boxes
The log is an append-only list of entries. "Append-only" means you only ever add to the end — you never insert in the middle and never edit a committed entry. (Followers can have wrong, uncommitted entries overwritten, as we'll see — but agreed-upon history is immutable.)
Each log entry carries three things:
- An index — the entry's position in the list: 1, 2, 3, 4, … Think of it as the line number. It defines the order.
- A command — the actual instruction for the state machine, e.g.
SET x 3. - A term (Raft's word) / epoch (ZooKeeper's word) / ballot (Paxos's word) — a number telling you "which leadership period created this entry." We'll explain terms fully in §2.5; for now just know each entry is stamped with one.
Here is one server's log drawn as boxes. Top number = term, bottom = command. The index runs left to right:
index: 1 2 3 4 5
┌────────┬────────┬────────┬────────┬────────┐
│ term 1 │ term 1 │ term 2 │ term 3 │ term 3 │
LOG → │ SET x 1│ SET y 5│ INCR x │ DEL y │ SET z 9│
└────────┴────────┴────────┴────────┴────────┘
▲
commitIndex = 4
(entries 1–4 are COMMITTED & safe;
entry 5 is only proposed, not yet safe)
2.4 "Committed" vs "applied" — two different milestones
This distinction trips people up constantly, so we'll be precise. An entry passes through milestones on its way to being real:
- Appended — the entry exists in some server's log, but might still be lost or overwritten. Not safe yet.
- Committed — the entry has been safely replicated (stored) on a majority of servers. "Majority" means more than half — for 5 servers, at least 3. Once committed, the entry is permanent: it can never be lost or changed, even if servers crash, because any future majority must overlap with this majority on at least one server that has the entry. Committed = "this decision is final."
- Applied — the command has actually been run against the local state machine, changing the data (the KV map). Applied = "this decision has taken effect on my copy."
Two pieces of per-server state track these milestones (the real field names from Raft):
commitIndex— the highest index known to be committed.lastApplied— the highest index this server has actually applied to its state machine.
The rule every server runs in a loop is simply: whenever commitIndex > lastApplied, increment lastApplied and apply that next log entry to the state machine. So "applied" always chases "committed." Commit is the agreement; apply is the side effect.
leaderCommit from the leader) but not yet have applied it. Commit decides "this is final and ordered"; apply decides "my data now reflects it." Only committed entries may ever be applied — never an uncommitted one, because an uncommitted entry could still be thrown away.commitIndex tracks the first; lastApplied tracks the second; lastApplied never gets ahead of commitIndex.2.5 Term / epoch / ballot numbers — a logical clock for leadership
Recall from the foundations guide: there is no global clock, so we can't order events by wall-clock time. We need a logical way to say "this happened in a later leadership period than that." That's what the term number is.
A term (Raft) — equivalently an epoch (ZAB) or ballot number (Paxos) — is a monotonically increasing integer that labels each period of leadership. "Monotonically increasing" just means it only ever goes up, never down: 1, 2, 3, 4, … Every time the cluster holds a new election to pick a leader, it bumps the number. A new election = a new term.
TIME (logical, not wall-clock) ───────────────────────────────►
┌───── term 1 ─────┐┌─ term 2 ─┐┌──────── term 3 ────────┐
│ leader = S1 ││ election ││ leader = S4 │
│ (normal work) ││ (no ldr) ││ (normal work) │
└──────────────────┘└──────────┘└────────────────────────┘
term goes 1 ──────────► 2 ──────────► 3 (only ever UP)
Some terms have a leader; some terms are just a failed election
(split vote) with no leader — then the cluster moves to the next term.
Why does a single ever-increasing number matter so much? Because it lets every server settle disputes locally with one comparison, and it cleanly evicts stale leaders. Two rules do almost all the work:
- Higher term always wins. Every message (every RPC — "Remote Procedure Call," a function call sent over the network to another server) carries its sender's term. When a server receives a message with a term higher than its own
currentTerm, it adopts that higher term and immediately steps down to being a plain follower. When a server receives a message with a term lower than its own, it rejects it — "you're out of date, ignore me." - One vote per term. A server stores who it voted for this term in a field called
votedFor, so it can never vote twice in the same term. Combined with "majority needed to win," this guarantees at most one leader per term — two candidates can't both collect a majority from the same set of voters in the same term.
This is precisely how the system avoids the nightmare of two leaders clashing. Suppose an old leader from term 2 was network-partitioned (cut off) for a while, and meanwhile the rest elected a new leader in term 3. When the partition heals and the old term-2 leader tries to push entries, every healthy server sees "term 2 < my term 3" and rejects it. The old leader, on hearing a reply carrying term 3, sees the higher number and steps down. No corruption — the higher term simply wins.
2.6 Leader-based vs leaderless — why nearly everyone elects a leader
There are two broad ways to agree on the log:
| Approach | How it works | Trade-off | Real systems |
|---|---|---|---|
| Leader-based | The cluster elects one server as leader. The leader alone decides the order of entries and pushes them to followers. All writes funnel through the leader. | Simple to reason about; one server sequences everything, so the order is obvious. Downside: the leader is a bottleneck and a single point of (re-elected) failure — if it dies you pause for an election. | Raft, Multi-Paxos (with a stable leader), ZooKeeper/ZAB, etcd, Consul, Chubby |
| Leaderless | No fixed leader; any server can propose, and servers run a voting round (or several) per individual log slot to agree on what goes there. | No single bottleneck and no election pause. Downside: dueling proposers — two servers proposing for the same slot can keep interrupting each other, causing extra rounds and stalls (livelock). | Basic (single-decree) Paxos; Egalitarian Paxos (EPaxos) |
Why do almost all production systems pick a leader? Because once you have a single leader, the order of the log is trivially decided — it's just "whatever order the leader received the requests in." There is no argument about sequencing, because one server is doing all the sequencing. This makes the system far easier to understand, debug, and reason about, and it makes the common case fast: a stable leader can skip the expensive "who's in charge?" negotiation and just stream entries. (This is exactly the optimization Multi-Paxos makes over basic Paxos — run the leader-election "prepare" phase once, then reuse that leadership to commit many log entries with a single round-trip each. Raft bakes a strong leader in from the start.)
2.7 How a client actually talks to the system
Now put yourself in the shoes of a client — your web application that wants to write data. The normal happy path in a leader-based system is:
CLIENT LEADER (S1) FOLLOWERS (S2, S3, S4, S5)
│ │ │ │ │ │
│ SET x 3 │ │ │ │ │
├────────────────►│ 1. append to own log │ │ │ │
│ │ (index 5, term 3) │ │ │ │
│ │ │ │ │ │
│ │ 2. AppendEntries RPC │ │ │ │
│ ├─────────────────────────►│ │ │ │ (replicate)
│ ├─────────────────────────────►│ │ │
│ ├─────────────────────────────────►│ │
│ │◄───── ACK ───────────────┤ │ │ │
│ │◄───── ACK ───────────────────┤ │ │
│ │ │ │ │ │
│ │ 3. majority (3/5) stored │ │ │ │
│ │ → COMMIT index 5 │ │ │ │
│ │ 4. apply to state machine│ │ │ │
│ result: OK │ │ │ │ │
│◄────────────────┤ 5. reply to client │ │ │ │
│ │ │ │ │ │
- The client sends its command to the leader. (If it accidentally contacts a follower, the follower redirects it to the leader.)
- The leader appends the command to its own log as a new entry (next index, current term).
- The leader replicates the entry to followers via the
AppendEntriesRPC (covered in §2.8). - Once a majority have stored it, the leader marks it committed and applies it to its own state machine.
- The leader replies to the client with the result. Followers learn the entry is committed from the leader's next message and apply it too.
Why clients need idempotency / de-duplication
Here's a subtle but critical problem. Suppose the leader commits your INCR x, applies it, and then crashes before the reply reaches you. Your client times out and retries. A new leader is elected — and now your INCR x risks being applied twice, turning a +1 into a +2. On a replicated state machine, a double-apply corrupts the data on every replica identically, so you can't even detect it by comparing replicas.
The fix is idempotency via de-duplication. Idempotent means "applying it twice has the same effect as applying it once." The standard technique:
- The client opens a session and tags every command with a unique, increasing serial number (a request id) — e.g.
{ requestId: 1742, cmd: "INCR x" }. - The state machine remembers, per client, the highest serial number it has already applied and the result it returned.
- If a command arrives whose serial number was already applied, the system does not re-run it — it just returns the remembered result.
This turns "at-least-once delivery" (the network may deliver your retry more than once) into "exactly-once effect" (the command changes state only once). This is what people loosely call "exactly-once" — it's really at-least-once delivery + idempotent processing.
2.8 The two RPCs that move the log (Raft's concrete fields)
To make this real, here are the exact fields of the two messages a leader-based system uses. We use Raft's names because they're the clearest; Paxos and ZAB have equivalents.
AppendEntries — replicate log entries (and act as a heartbeat)
Sent by the leader to followers to copy entries. When sent with an empty entries[] list, it doubles as a heartbeat ("I'm still alive, don't start an election").
| Field | Meaning |
|---|---|
term | The leader's current term. Lets a follower reject a stale leader. |
leaderId | Who the leader is, so followers can redirect clients to it. |
prevLogIndex | The index of the entry immediately before the new ones — the "glue point." |
prevLogTerm | The term of the entry at prevLogIndex. Together with prevLogIndex this is the consistency check. |
entries[] | The new log entries to store (empty for a heartbeat). |
leaderCommit | The leader's commitIndex, so the follower can learn which entries are now committed and may be applied. |
RequestVote — ask other servers to elect you leader
Sent by a candidate (a server trying to become leader) during an election.
| Field | Meaning |
|---|---|
term | The candidate's new term (it bumped the number to start this election). |
candidateId | Who is asking for the vote. |
lastLogIndex | Index of the candidate's last log entry — used to prove its log is "up-to-date." |
lastLogTerm | Term of the candidate's last log entry — the primary measure of "up-to-date." |
(The leader also keeps two per-follower bookkeeping arrays: nextIndex[] — the next index it will try to send each follower — and matchIndex[] — the highest index it knows is safely replicated on each follower. These drive the repair process below.)
2.9 The Log Matching Property and repairing a diverging follower
Followers can fall behind or even hold wrong entries — for example, a server that briefly followed a leader who crashed before its entries got committed. We need a way to force every follower's log back into agreement with the leader's, safely. Raft guarantees this with the Log Matching Property:
How the consistency check enforces it: every AppendEntries says "before these new entries, you should have an entry at prevLogIndex with term prevLogTerm." The follower checks: do I actually have that exact entry? If yes, the leader's history matches mine up to that point — accept the new entries. If no, reject the RPC. On rejection the leader decrements nextIndex for that follower and retries with an earlier prevLogIndex, walking backwards one step at a time until it finds the last point where the two logs agree. From there it ships everything forward, and the follower overwrites any conflicting entries with the leader's.
Here is the repair, drawn. Server S3 has a diverged tail (a leftover term 2 entry at index 3 that never committed). The leader (term 3) walks nextIndex back to the agreement point and overwrites:
BEFORE REPAIR
index: 1 2 3 4 5
LEADER S1: [T1 a ][T1 b ][T3 c ][T3 d ][T3 e ] ← source of truth
follower S2:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ] ✓ already matches
follower S3:[T1 a ][T1 b ][T2 x ] ✗ index 3 term differs
follower S4:[T1 a ][T1 b ] (just behind, no conflict)
Leader sends AppendEntries to S3 with prevLogIndex=4 ... reject (S3 has no #4)
... prevLogIndex=3, prevLogTerm=3 ... reject (S3 has T2 not T3)
... prevLogIndex=2, prevLogTerm=1 ... MATCH! (both [T1 b])
→ agreement point found at index 2. Leader ships indexes 3,4,5.
S3 OVERWRITES its bad [T2 x] at index 3.
AFTER REPAIR
index: 1 2 3 4 5
LEADER S1: [T1 a ][T1 b ][T3 c ][T3 d ][T3 e ]
follower S2:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ] ✓
follower S3:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ] ✓ repaired (T2 x discarded)
follower S4:[T1 a ][T1 b ][T3 c ][T3 d ][T3 e ] ✓ caught up
└──────── all logs now IDENTICAL ────────┘
→ apply in order → all state machines IDENTICAL
[T2 x] at index 3 came from a previous leader (term 2) that died before that entry reached a majority — so it was never committed, and discarding it loses nothing anyone was promised. The leader's backward walk stops at index 2, where both logs hold [T1 b]. By the Log Matching Property, everything at index 1–2 is therefore already identical, so the leader only needs to resend from index 3. S3 overwrites [T2 x] with [T3 c] and copies [T3 d][T3 e]. Now all five logs are byte-for-byte equal — and because each server applies them in index order to a deterministic state machine, all five KV stores end up holding exactly the same data.nextIndex repair.2.10 The two jobs of every consensus protocol
Everything above collapses into two jobs that every leader-based consensus protocol must do. The rest of this study guide — all of Raft, all of Multi-Paxos, all of ZAB — is just two different ways of doing these same two jobs:
- Agree on a leader (leader election). Pick exactly one server to sequence the log, and make sure stale leaders are harmlessly evicted. The term/epoch/ballot number plus majority voting are the tools that make "at most one leader per term" true. (This is what
RequestVoteand terms are for.) - Replicate the log safely (log replication). Get every committed entry onto a majority, in the same index order, and repair any follower that diverges — so that "committed" entries are permanent and identical everywhere. (This is what
AppendEntries, the consistency check,commitIndex, andnextIndexrepair are for.)
┌─────────────────────────────────────────────────────────┐ │ CONSENSUS PROTOCOL │ │ │ │ JOB 1: ELECT A LEADER JOB 2: REPLICATE THE LOG │ │ ───────────────────── ───────────────────────── │ │ • bump term/epoch • leader appends entry │ │ • RequestVote • AppendEntries to followers │ │ • need majority of votes • majority stored → COMMIT │ │ • one vote per term • consistency check + repair│ │ • higher term wins • apply in index order │ │ │ │ │ │ └──────────────┬───────────────┘ │ │ ▼ │ │ identical ordered logs on every replica │ │ ▼ │ │ identical deterministic state machines = same data │ └─────────────────────────────────────────────────────────┘
Common mistakes & misconceptions
lastApplied never overtakes commitIndex.Best practices
Section summary
- A Replicated State Machine (RSM): identical deterministic state machines that start equal and apply the same commands in the same order stay equal forever — so consensus reduces to "agree on the ordered log."
- The log is an append-only list of entries; each entry has an index (order), a command, and a term/epoch/ballot (which leadership period created it).
- Committed (a majority has stored it → final and permanent) is distinct from applied (run against the local state machine);
commitIndextracks the first,lastAppliedthe second, and apply only ever follows commit. - Term/epoch/ballot is a monotonic logical clock for leadership: "a new election = a new term," higher term wins, lower term is rejected, one vote per term → at most one leader per term, and stale leaders are evicted harmlessly.
- Almost all production systems are leader-based (Raft, Multi-Paxos, ZAB/ZooKeeper, etcd, Consul, Chubby) because one leader makes log order trivial and the common case fast; leaderless designs avoid the bottleneck but risk dueling proposers.
- Clients send commands to the leader, which appends → replicates (
AppendEntries) → commits on majority → applies → replies; followers learn the commit point fromleaderCommit. - Idempotency via request ids / client sessions is required to stop retried commands from being applied twice — consensus alone does not give exactly-once execution.
- The Log Matching Property (same index + same term ⇒ identical entry and identical history before it) plus the
prevLogIndex/prevLogTermconsistency check let a leader repair a diverging follower by walkingnextIndexback to the agreement point and overwriting its uncommitted tail. - Every consensus protocol has exactly two jobs: (1) agree on a leader and (2) replicate the log safely — Raft and Paxos are just two ways of doing these same two things.