Raft — Log Replication, Safety & Membership
In the previous section you saw how Raft picks a single leader (the one server allowed to accept new commands) and how a term (a numbered period of time, each with at most one leader) bounds the chaos. This section covers the harder half of Raft: once you have a leader, how do you copy the stream of commands to every server so that all of them end up identical — and stay identical even when servers crash, fall behind, or you add and remove machines? That stream of commands is called the replicated log, and keeping it consistent is the whole job.
Let me define the core vocabulary up front, in plain words, so the rest reads smoothly.
- Log — an ordered list of entries. Each server keeps its own copy. The goal is to make every copy identical.
- Log entry — one slot in the log. It holds three things: a command (the actual operation to run, e.g. "set x = 5"), the term in which the leader created it, and its position number, the index (1, 2, 3, …).
- State machine — the thing the log feeds. It is just your application's data plus the rule "apply commands in log order." If every server applies the same commands in the same order, every server reaches the same final state. That is the entire point of consensus.
- Apply / commit — an entry is committed once Raft guarantees it will never be lost; only then is it safe to feed ("apply") it to the state machine and reply to the client.
- commitIndex — the highest log index a server knows is committed.
- Majority — more than half the servers. In a 5-server cluster a majority is 3. Any two majorities always share at least one server — that overlap is the mathematical trick everything below rests on.
A replicated log (5-server cluster). Each box = one entry: term^index / command
idx: 1 2 3 4 5
Leader [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ][ 3|z=7 ][ 3|x=0 ]
Follower A [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ][ 3|z=7 ][ 3|x=0 ] (in sync)
Follower B [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ] (lagging by 2)
Follower C [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ][ 3|z=7 ][ 3|x=0 ]
Follower D [ 1|x=3 ][ 1|y=1 ] (lagging by 3)
"1|x=3" means: created in TERM 1, command is x=3, sitting at INDEX 1.
The number before the bar is the TERM, not the value.
4.1 The AppendEntries RPC in detail
An RPC ("remote procedure call") is just a message one server sends another that says "run this and tell me the result." Raft uses only two RPCs. You met RequestVote in the election section. The workhorse here is AppendEntries. The leader sends it to every follower both to copy new entries and, when there are no new entries, as a heartbeat (an empty AppendEntries that says "I'm still leader, stay quiet").
Here are the fields the leader puts in each AppendEntries request:
| Field | Plain meaning |
|---|---|
term | The leader's current term. Lets the follower detect a stale (old) leader. |
leaderId | Who the leader is, so a follower can redirect clients to it. |
prevLogIndex | The index of the entry immediately before the new ones. This is the "join point." |
prevLogTerm | The term of that entry at prevLogIndex. Together with prevLogIndex it is a fingerprint of the follower's log up to that point. |
entries[] | The new entries to store (empty for a heartbeat). |
leaderCommit | The leader's commitIndex — tells the follower how far it is safe to apply. |
The reply is tiny: the follower's own term (so the leader can step down if it's behind) and a boolean success.
When a follower receives AppendEntries, it runs these steps in order:
- Term check. If the request's
termis older than the follower's own term, replysuccess = false— this is a stale leader; ignore it. Otherwise accept this leader and reset the election timeout (so it won't start its own election). - Consistency check. Look at the follower's own entry at
prevLogIndex. If the follower has no entry there, or it has one but its term ≠prevLogTerm, replysuccess = false. The logs disagree at the join point, so appending now would create a hole or a mismatch. - Repair conflicts. If an existing entry conflicts with a new one (same index, different term), delete that entry and everything after it. The leader's copy always wins.
- Append. Add any new entries not already present.
- Advance commit. If
leaderCommit> the follower'scommitIndex, setcommitIndex = min(leaderCommit, index of last new entry), then apply newly-committed entries to the state machine. - Reply
success = true.
The Log Matching Property — why the consistency check is enough
Step 2 looks almost too simple — it only checks one entry, the one at prevLogIndex. How can checking a single entry guarantee the whole log up to that point matches? Because of an invariant Raft maintains called the Log Matching Property:
If two logs contain an entry with the same index and the same term, then (a) they store the same command at that index, and (b) the logs are identical in every entry before it.
Part (a) holds because a leader creates at most one entry per index per term, and entries never move once created. Part (b) holds by induction, and the AppendEntries consistency check is exactly what enforces it: a follower only accepts new entries if its entry at prevLogIndex already matches the leader's term there. So matching at the join point implies matching all the way back. This is the load-bearing property of the whole protocol — it turns "are these two long logs identical?" into a single cheap comparison.
prevLogIndex + prevLogTerm act as a checksum of everything before the new entries. If they match, the entire prefix matches — that is the Log Matching Property, and it lets Raft verify and repair logs one entry at a time instead of comparing the whole thing.
4.2 Consistency check & repair — fixing a divergent follower
After a crash, a follower's log can be wrong in two ways: it can be missing entries the leader has, or it can have extra entries (from a previous leader) that were never committed and must be thrown away. Raft fixes both with one mechanism.
The leader keeps, for each follower, a number called nextIndex: "the next log index I will try to send you." When a leader is first elected, it optimistically sets every follower's nextIndex to its own last index + 1. It then sends AppendEntries with prevLogIndex = nextIndex − 1. If the follower rejects (the consistency check fails), the leader decrements nextIndex and retries. It keeps backing up until it finds the last index where the two logs agree. From that point forward, its entries flow in, and the follower's step 3 deletes any conflicting tail. Eventually the follower's log is byte-for-byte the leader's.
LOG REPAIR: leader backs up nextIndex until logs agree, then overwrites the tail.
idx: 1 2 3 4 5 6 7
Leader (term 4) [ 1 ][ 1 ][ 1 ][ 4 ][ 4 ][ 4 ][ 4 ] (numbers = TERM of entry)
Follower X [ 1 ][ 1 ][ 1 ][ 2 ][ 2 ][ 3 ]
^^^^^^^^^^^^^^
extra, UNCOMMITTED entries from old terms 2 & 3
Leader starts: nextIndex(X) = 7 -> AppendEntries prevLogIndex=6,prevLogTerm=4
Follower X@6 has term 3, not 4 -> success=false. Leader: nextIndex 7->6
prevLogIndex=5,prevLogTerm=4 : X@5 has term 2 -> false. nextIndex 6->5
prevLogIndex=4,prevLogTerm=4 : X@4 has term 2 -> false. nextIndex 5->4
prevLogIndex=3,prevLogTerm=1 : X@3 has term 1 -> MATCH! success=true.
Leader now sends entries[4..7] = (4,4,4,4).
Follower deletes its old 4,5,6 (terms 2,2,3) and writes the leader's.
Result -> Follower X [ 1 ][ 1 ][ 1 ][ 4 ][ 4 ][ 4 ][ 4 ] identical.
AppendEntries (prevLogIndex=4) is rejected because D has nothing at index 4. The leader backs up to prevLogIndex=2, which D does have and which matches, so it ships entries 3, 4, 5. No deletion needed — D was simply behind. The "extra entries" case (Follower X above) is the interesting one, because there Raft must discard committed-looking-but-actually-uncommitted work. That's safe precisely because those entries were never committed (never on a majority), so no client was ever told they succeeded.
nextIndex by one per round-trip is correct but slow when a follower is far behind (one RPC per missing entry). Production systems use the paper's optimization: the rejected follower returns the term of the conflicting entry and the first index it has for that term, letting the leader skip a whole term's worth of indexes in one jump. etcd and others implement this. Don't ship the naive one-at-a-time version under load.
4.3 Commitment — the majority rule and the Figure-8 subtlety
An entry is committed the moment it is safely stored on a majority of servers. Once committed, it is durable: even if any minority of servers crash, every future leader will still contain it (you'll see why in §4.4). After committing, the leader advances its commitIndex, applies the entry, replies to the client, and piggybacks the new commitIndex on the next AppendEntries via leaderCommit so followers apply it too.
COMMIT BY MAJORITY (5 servers, majority = 3). Leader appends entry @ index 5.
t0 Client --"set x=0"--> Leader
t1 Leader appends [3|x=0]@5 to its OWN log (stored on 1 of 5)
t2 Leader --AppendEntries(entries=[3|x=0])--> A,B,C,D
A: stored, ack --> 2 of 5
B: stored, ack --> 3 of 5 *** MAJORITY reached ***
t3 Leader sees 3/5 -> commitIndex = 5 -> APPLY x=0 -> reply "OK" to client
t4 (C and D ack later; they catch up via leaderCommit on next heartbeat)
The entry is COMMITTED at t2 even though C and D haven't stored it yet.
Durability comes from the majority overlap, not from "everyone".
Why a leader must not commit an old-term entry just by counting replicas (the Figure-8 problem)
Here is the famous subtlety. Suppose a new leader finds an entry from an earlier term sitting in its log, replicated on a majority. It is tempting to say "it's on a majority, therefore it's committed." This is wrong, and it can lose committed data. The danger: an entry can be on a majority of servers at one moment, then a different server with a shorter log can still win a future election and overwrite that entry — because elections compare logs by term first, not by length (see §4.4). So "on a majority right now" does not by itself mean "permanent."
Raft's fix is a rule called the commitment restriction:
A leader only counts replicas to commit entries from its own current term. Entries from earlier terms are committed only indirectly — once an entry from the current term that sits after them gets committed, the Log Matching Property drags the older entries along with it.
So a fresh leader does not "re-commit" what it inherited. It commits something new in its own term; the moment that new entry is on a majority and committed, everything before it is committed too, safely.
E was created by leader S1 in term 2 and copied to a majority — but not yet committed. S1 crashes. S5 (which never had E) wins term 3 because its log isn't behind on term. S5 crashes; S1 comes back as leader in term 4 and finishes spreading E to all. If S1 were allowed to call E "committed" just because it's now on a majority, disaster looms: S5 could come back, win again, and overwrite E — erasing an entry we'd called committed. Raft forbids this by never committing E (a term-2 entry) via replica-counting in term 4. E only becomes committed when S1 commits a fresh term-4 entry after it; at that point S5 can no longer win, so E is truly safe.
4.4 Safety — how Raft guarantees committed entries are never lost
Everything above only works if Raft can promise one thing: a committed entry is never lost, never overwritten, and never appears at a different index later. That promise is the State Machine Safety Property: if any server has applied an entry at a given index, no other server will ever apply a different entry at that index. Raft reaches it through two smaller guarantees.
(1) The Election Restriction — only an up-to-date candidate can win
When a candidate asks for a vote via RequestVote, it includes its lastLogIndex and lastLogTerm. A follower grants its vote only if the candidate's log is at least as up-to-date as its own. "Up-to-date" is compared like this:
- Compare the last log term of both. The higher term wins — it is more up-to-date.
- If the last terms are equal, the longer log (higher last index) wins.
Because a candidate needs a majority of votes, and any majority overlaps the majority that stored a committed entry, at least one voter has that committed entry. That voter will refuse to vote for any candidate whose log is missing it (such a candidate would be "less up-to-date"). Therefore a candidate that lacks a committed entry can never assemble a majority. It cannot win.
(2) Leader Completeness → State Machine Safety
The Election Restriction gives the Leader Completeness Property: every entry committed in some term is present in the logs of all leaders of higher terms. In plain words: once an entry is committed, every future leader already has it. Combine that with the rule that a leader never deletes or overwrites its own entries (it only appends), and you get the conclusion: a committed entry sits at the same index in every leader from now on, gets copied to all followers, and is applied identically everywhere. That is exactly State Machine Safety.
4.5 Cluster membership changes
Real clusters grow, shrink, and replace dead machines. The configuration is the set of servers that count toward a majority. Changing it is dangerous because for a brief moment the cluster might disagree about who the members are, and that can produce two disjoint majorities electing two leaders at once — a split brain, the exact thing consensus exists to prevent.
WHY A NAIVE SWITCH IS UNSAFE: growing from 3 servers to 5 in one step.
Old config C-old = {S1,S2,S3} majority = 2
New config C-new = {S1,S2,S3,S4,S5} majority = 3
If servers adopt the new config at different times, you can get:
S1,S2 still think config is C-old -> {S1,S2} = majority of C-old
S3,S4,S5 think config is C-new -> {S3,S4,S5} = majority of C-new
TWO majorities, no overlap -> S1 and S3 can BOTH win elections
-> TWO leaders in the same term. Split brain.
Joint consensus (C-old,new) — the general, safe method
Raft's original answer is a two-phase switch through a transitional combined configuration written C-old,new. The trick: while in this transitional state, a decision (winning an election, committing an entry) requires separate majorities from both the old set and the new set — not one merged majority. This makes overlapping majorities impossible, so no two leaders can co-exist. The steps:
- The leader appends a special configuration entry
C-old,newto its log and replicates it. The instant a server stores this entry (even before it's committed) it starts using it for all future decisions. - While
C-old,newis in force, every election and every commit needs a majority of C-old and a majority of C-new. Neither old nor new can act alone. - Once
C-old,newis committed, the leader appends a finalC-newentry. After that commits, the transition is done; servers not in C-new shut down.
The overlap at each hop (old↔joint, joint↔new) guarantees a single decision-maker throughout. Clients keep being served during the change — there is no downtime.
Single-server-at-a-time — the simpler method
Joint consensus is fiddly to implement, so Ongaro's dissertation describes a simpler approach that most systems use: add or remove only one server per configuration change. Adding or removing a single server cannot create two disjoint majorities — the old and new majorities are guaranteed to overlap (the math: changing the count by one can't split the cluster into two non-overlapping majority halves). To go from 3 to 5, you do two single-server changes in sequence. This is what etcd and Consul use in practice.
AppendEntries without counting toward any majority. Only promote it to a full voter once its log is current. Otherwise a slow, empty new member can stall commits (the leader can't reach a majority that includes a member with nothing) or even trigger spurious leader changes. etcd's "learner" and Consul's non-voting members exist for exactly this.
4.6 Log compaction & snapshotting
The log grows forever — every command is a new entry. Left unchecked it would fill the disk and make restarts (which replay the whole log) unbearably slow. The fix is a snapshot: a compact dump of the state machine's current state at a point in time. Once you have a snapshot up to index N, you can throw away log entries 1…N, because their combined effect is already captured in the snapshot.
Each server snapshots independently. A snapshot records the state plus two pieces of bookkeeping: lastIncludedIndex (the last log index it covers) and lastIncludedTerm (that entry's term) — kept so the Log Matching consistency check still works at the boundary after the entries are gone.
This creates a corner case: what if a follower is so far behind that the leader has already discarded the entries that follower still needs? The leader can't send entries it no longer has. So Raft adds a third RPC, InstallSnapshot: the leader ships its whole snapshot to the lagging follower, who installs it wholesale (discarding its own now-superseded log) and resumes normal AppendEntries from the snapshot's end. Its fields: term, leaderId, lastIncludedIndex, lastIncludedTerm, offset + data (the snapshot bytes, sent in chunks for large states), and done (last chunk flag).
InstallSnapshot with lastIncludedIndex = 950,000; the follower replaces its stale log with that snapshot in one shot, then catches up the last few thousand entries normally — far faster than replaying a million.
Common mistakes & misconceptions
prevLogIndex. The Log Matching Property is what makes that one check imply the entire prefix matches.
Best practices
nextIndex by one per RPC. It turns slow per-entry repair into per-term jumps.
lastIncludedIndex/lastIncludedTerm so the boundary consistency check still works, and stream large snapshots in chunks.
currentTerm, the vote, and the log to stable storage before responding to any RPC. Acknowledging an entry you haven't durably stored breaks the majority-durability guarantee on a crash.
Section summary
AppendEntriescarriesprevLogIndex/prevLogTermas a fingerprint; a follower accepts new entries only if that fingerprint matches, which enforces the Log Matching Property (same index+term ⇒ identical logs up to there).- A divergent follower is repaired by the leader backing up
nextIndexuntil the logs agree, then overwriting the follower's conflicting tail with its own entries. - An entry is committed once stored on a majority; the leader then advances
commitIndexand tells followers vialeaderCommit. - A leader commits by replica-count only for entries in its own term; older entries commit indirectly once a current-term entry on top of them commits — this is the Figure-8 fix.
- The Election Restriction (vote only for an at-least-as-up-to-date log, compared by last term then last index) plus majority overlap yields Leader Completeness and therefore State Machine Safety: committed entries are never lost or reordered.
- Naive membership changes can create two majorities and split brain; Raft prevents this with joint consensus (C-old,new) or the simpler single-server-at-a-time method.
- Logs can't grow forever; snapshots compact committed state, and
InstallSnapshotbrings a far-behind follower up to date when needed entries have been discarded. - Across the protocol, "committed" is an absolute promise — once an entry is committed, every future leader has it and every server eventually applies it at the same index.