Raft — Log Replication, Safety & Membership

By Pritesh Yadav 21 min read

In the previous section you saw how Raft picks a single leader (the one server allowed to accept new commands) and how a term (a numbered period of time, each with at most one leader) bounds the chaos. This section covers the harder half of Raft: once you have a leader, how do you copy the stream of commands to every server so that all of them end up identical — and stay identical even when servers crash, fall behind, or you add and remove machines? That stream of commands is called the replicated log, and keeping it consistent is the whole job.

Let me define the core vocabulary up front, in plain words, so the rest reads smoothly.

  • Log — an ordered list of entries. Each server keeps its own copy. The goal is to make every copy identical.
  • Log entry — one slot in the log. It holds three things: a command (the actual operation to run, e.g. "set x = 5"), the term in which the leader created it, and its position number, the index (1, 2, 3, …).
  • State machine — the thing the log feeds. It is just your application's data plus the rule "apply commands in log order." If every server applies the same commands in the same order, every server reaches the same final state. That is the entire point of consensus.
  • Apply / commit — an entry is committed once Raft guarantees it will never be lost; only then is it safe to feed ("apply") it to the state machine and reply to the client.
  • commitIndex — the highest log index a server knows is committed.
  • Majority — more than half the servers. In a 5-server cluster a majority is 3. Any two majorities always share at least one server — that overlap is the mathematical trick everything below rests on.
A replicated log (5-server cluster). Each box = one entry: term^index / command

           idx:   1      2      3      4      5
        Leader [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ][ 3|z=7 ][ 3|x=0 ]
   Follower A  [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ][ 3|z=7 ][ 3|x=0 ]   (in sync)
   Follower B  [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ]                      (lagging by 2)
   Follower C  [ 1|x=3 ][ 1|y=1 ][ 2|x=9 ][ 3|z=7 ][ 3|x=0 ]
   Follower D  [ 1|x=3 ][ 1|y=1 ]                               (lagging by 3)

   "1|x=3" means: created in TERM 1, command is x=3, sitting at INDEX 1.
   The number before the bar is the TERM, not the value.
Analogy: Think of the log as a shared notebook that a team is copying by hand. The leader is the one person allowed to write a new line. Everyone else copies the leader's notebook line for line. The trouble is that copiers crash, lose pages, or copy stale versions — so the leader needs a disciplined protocol to detect "your page 4 doesn't match mine" and fix it without ever tearing out a line that the team has already officially agreed on.

4.1 The AppendEntries RPC in detail

An RPC ("remote procedure call") is just a message one server sends another that says "run this and tell me the result." Raft uses only two RPCs. You met RequestVote in the election section. The workhorse here is AppendEntries. The leader sends it to every follower both to copy new entries and, when there are no new entries, as a heartbeat (an empty AppendEntries that says "I'm still leader, stay quiet").

Here are the fields the leader puts in each AppendEntries request:

FieldPlain meaning
termThe leader's current term. Lets the follower detect a stale (old) leader.
leaderIdWho the leader is, so a follower can redirect clients to it.
prevLogIndexThe index of the entry immediately before the new ones. This is the "join point."
prevLogTermThe term of that entry at prevLogIndex. Together with prevLogIndex it is a fingerprint of the follower's log up to that point.
entries[]The new entries to store (empty for a heartbeat).
leaderCommitThe leader's commitIndex — tells the follower how far it is safe to apply.

The reply is tiny: the follower's own term (so the leader can step down if it's behind) and a boolean success.

When a follower receives AppendEntries, it runs these steps in order:

  1. Term check. If the request's term is older than the follower's own term, reply success = false — this is a stale leader; ignore it. Otherwise accept this leader and reset the election timeout (so it won't start its own election).
  2. Consistency check. Look at the follower's own entry at prevLogIndex. If the follower has no entry there, or it has one but its term ≠ prevLogTerm, reply success = false. The logs disagree at the join point, so appending now would create a hole or a mismatch.
  3. Repair conflicts. If an existing entry conflicts with a new one (same index, different term), delete that entry and everything after it. The leader's copy always wins.
  4. Append. Add any new entries not already present.
  5. Advance commit. If leaderCommit > the follower's commitIndex, set commitIndex = min(leaderCommit, index of last new entry), then apply newly-committed entries to the state machine.
  6. Reply success = true.

The Log Matching Property — why the consistency check is enough

Step 2 looks almost too simple — it only checks one entry, the one at prevLogIndex. How can checking a single entry guarantee the whole log up to that point matches? Because of an invariant Raft maintains called the Log Matching Property:

If two logs contain an entry with the same index and the same term, then (a) they store the same command at that index, and (b) the logs are identical in every entry before it.

Part (a) holds because a leader creates at most one entry per index per term, and entries never move once created. Part (b) holds by induction, and the AppendEntries consistency check is exactly what enforces it: a follower only accepts new entries if its entry at prevLogIndex already matches the leader's term there. So matching at the join point implies matching all the way back. This is the load-bearing property of the whole protocol — it turns "are these two long logs identical?" into a single cheap comparison.

Key takeaway: prevLogIndex + prevLogTerm act as a checksum of everything before the new entries. If they match, the entire prefix matches — that is the Log Matching Property, and it lets Raft verify and repair logs one entry at a time instead of comparing the whole thing.

4.2 Consistency check & repair — fixing a divergent follower

After a crash, a follower's log can be wrong in two ways: it can be missing entries the leader has, or it can have extra entries (from a previous leader) that were never committed and must be thrown away. Raft fixes both with one mechanism.

The leader keeps, for each follower, a number called nextIndex: "the next log index I will try to send you." When a leader is first elected, it optimistically sets every follower's nextIndex to its own last index + 1. It then sends AppendEntries with prevLogIndex = nextIndex − 1. If the follower rejects (the consistency check fails), the leader decrements nextIndex and retries. It keeps backing up until it finds the last index where the two logs agree. From that point forward, its entries flow in, and the follower's step 3 deletes any conflicting tail. Eventually the follower's log is byte-for-byte the leader's.

LOG REPAIR: leader backs up nextIndex until logs agree, then overwrites the tail.

                 idx: 1    2    3    4    5    6    7
   Leader (term 4) [ 1 ][ 1 ][ 1 ][ 4 ][ 4 ][ 4 ][ 4 ]   (numbers = TERM of entry)
   Follower X      [ 1 ][ 1 ][ 1 ][ 2 ][ 2 ][ 3 ]
                                    ^^^^^^^^^^^^^^
                          extra, UNCOMMITTED entries from old terms 2 & 3

   Leader starts: nextIndex(X) = 7  -> AppendEntries prevLogIndex=6,prevLogTerm=4
     Follower X@6 has term 3, not 4  -> success=false. Leader: nextIndex 7->6
   prevLogIndex=5,prevLogTerm=4 : X@5 has term 2  -> false. nextIndex 6->5
   prevLogIndex=4,prevLogTerm=4 : X@4 has term 2  -> false. nextIndex 5->4
   prevLogIndex=3,prevLogTerm=1 : X@3 has term 1  -> MATCH! success=true.
     Leader now sends entries[4..7] = (4,4,4,4).
     Follower deletes its old 4,5,6 (terms 2,2,3) and writes the leader's.

   Result -> Follower X [ 1 ][ 1 ][ 1 ][ 4 ][ 4 ][ 4 ][ 4 ]   identical.
Example (missing entries, the easy case): Follower D has only indexes 1–2; the leader has 1–5. The first AppendEntries (prevLogIndex=4) is rejected because D has nothing at index 4. The leader backs up to prevLogIndex=2, which D does have and which matches, so it ships entries 3, 4, 5. No deletion needed — D was simply behind. The "extra entries" case (Follower X above) is the interesting one, because there Raft must discard committed-looking-but-actually-uncommitted work. That's safe precisely because those entries were never committed (never on a majority), so no client was ever told they succeeded.
Best practice: Decrementing nextIndex by one per round-trip is correct but slow when a follower is far behind (one RPC per missing entry). Production systems use the paper's optimization: the rejected follower returns the term of the conflicting entry and the first index it has for that term, letting the leader skip a whole term's worth of indexes in one jump. etcd and others implement this. Don't ship the naive one-at-a-time version under load.

4.3 Commitment — the majority rule and the Figure-8 subtlety

An entry is committed the moment it is safely stored on a majority of servers. Once committed, it is durable: even if any minority of servers crash, every future leader will still contain it (you'll see why in §4.4). After committing, the leader advances its commitIndex, applies the entry, replies to the client, and piggybacks the new commitIndex on the next AppendEntries via leaderCommit so followers apply it too.

COMMIT BY MAJORITY (5 servers, majority = 3). Leader appends entry @ index 5.

   t0  Client --"set x=0"--> Leader
   t1  Leader appends [3|x=0]@5 to its OWN log         (stored on 1 of 5)
   t2  Leader --AppendEntries(entries=[3|x=0])--> A,B,C,D
          A: stored, ack   -->  2 of 5
          B: stored, ack   -->  3 of 5  *** MAJORITY reached ***
   t3  Leader sees 3/5 -> commitIndex = 5 -> APPLY x=0 -> reply "OK" to client
   t4  (C and D ack later; they catch up via leaderCommit on next heartbeat)

   The entry is COMMITTED at t2 even though C and D haven't stored it yet.
   Durability comes from the majority overlap, not from "everyone".

Why a leader must not commit an old-term entry just by counting replicas (the Figure-8 problem)

Here is the famous subtlety. Suppose a new leader finds an entry from an earlier term sitting in its log, replicated on a majority. It is tempting to say "it's on a majority, therefore it's committed." This is wrong, and it can lose committed data. The danger: an entry can be on a majority of servers at one moment, then a different server with a shorter log can still win a future election and overwrite that entry — because elections compare logs by term first, not by length (see §4.4). So "on a majority right now" does not by itself mean "permanent."

Raft's fix is a rule called the commitment restriction:

A leader only counts replicas to commit entries from its own current term. Entries from earlier terms are committed only indirectly — once an entry from the current term that sits after them gets committed, the Log Matching Property drags the older entries along with it.

So a fresh leader does not "re-commit" what it inherited. It commits something new in its own term; the moment that new entry is on a majority and committed, everything before it is committed too, safely.

Example (Figure 8, simplified): Entry E was created by leader S1 in term 2 and copied to a majority — but not yet committed. S1 crashes. S5 (which never had E) wins term 3 because its log isn't behind on term. S5 crashes; S1 comes back as leader in term 4 and finishes spreading E to all. If S1 were allowed to call E "committed" just because it's now on a majority, disaster looms: S5 could come back, win again, and overwrite E — erasing an entry we'd called committed. Raft forbids this by never committing E (a term-2 entry) via replica-counting in term 4. E only becomes committed when S1 commits a fresh term-4 entry after it; at that point S5 can no longer win, so E is truly safe.
Common mistake: Believing "stored on a majority = committed, always." It is only true for an entry created in the current leader's term. For inherited older-term entries, on-a-majority is necessary but not sufficient — they become safe only once a current-term entry on top of them commits.

4.4 Safety — how Raft guarantees committed entries are never lost

Everything above only works if Raft can promise one thing: a committed entry is never lost, never overwritten, and never appears at a different index later. That promise is the State Machine Safety Property: if any server has applied an entry at a given index, no other server will ever apply a different entry at that index. Raft reaches it through two smaller guarantees.

(1) The Election Restriction — only an up-to-date candidate can win

When a candidate asks for a vote via RequestVote, it includes its lastLogIndex and lastLogTerm. A follower grants its vote only if the candidate's log is at least as up-to-date as its own. "Up-to-date" is compared like this:

  1. Compare the last log term of both. The higher term wins — it is more up-to-date.
  2. If the last terms are equal, the longer log (higher last index) wins.

Because a candidate needs a majority of votes, and any majority overlaps the majority that stored a committed entry, at least one voter has that committed entry. That voter will refuse to vote for any candidate whose log is missing it (such a candidate would be "less up-to-date"). Therefore a candidate that lacks a committed entry can never assemble a majority. It cannot win.

Analogy: Imagine electing a new editor for the team notebook, and the rule is "you may only vote for a candidate whose notebook is at least as complete as yours." Any officially-agreed line is in a majority of notebooks. To win, a candidate needs a majority of votes — but at least one of those voters holds every agreed line, and that person won't endorse anyone missing one. So the new editor always already has every agreed line. Nothing agreed can be lost across a handover.

(2) Leader Completeness → State Machine Safety

The Election Restriction gives the Leader Completeness Property: every entry committed in some term is present in the logs of all leaders of higher terms. In plain words: once an entry is committed, every future leader already has it. Combine that with the rule that a leader never deletes or overwrites its own entries (it only appends), and you get the conclusion: a committed entry sits at the same index in every leader from now on, gets copied to all followers, and is applied identically everywhere. That is exactly State Machine Safety.

Key takeaway: The safety argument is one clean chain — majority overlap means at least one voter holds each committed entry; the Election Restriction means that voter blocks any leader missing it; so every future leader is complete; and append-only logs mean a committed entry can never be displaced. Committed = permanent.

4.5 Cluster membership changes

Real clusters grow, shrink, and replace dead machines. The configuration is the set of servers that count toward a majority. Changing it is dangerous because for a brief moment the cluster might disagree about who the members are, and that can produce two disjoint majorities electing two leaders at once — a split brain, the exact thing consensus exists to prevent.

WHY A NAIVE SWITCH IS UNSAFE: growing from 3 servers to 5 in one step.

   Old config C-old = {S1,S2,S3}     majority = 2
   New config C-new = {S1,S2,S3,S4,S5} majority = 3

   If servers adopt the new config at different times, you can get:
       S1,S2  still think config is C-old  -> {S1,S2} = majority of C-old
       S3,S4,S5 think config is C-new      -> {S3,S4,S5} = majority of C-new

   TWO majorities, no overlap  ->  S1 and S3 can BOTH win elections
                                ->  TWO leaders in the same term. Split brain.

Joint consensus (C-old,new) — the general, safe method

Raft's original answer is a two-phase switch through a transitional combined configuration written C-old,new. The trick: while in this transitional state, a decision (winning an election, committing an entry) requires separate majorities from both the old set and the new set — not one merged majority. This makes overlapping majorities impossible, so no two leaders can co-exist. The steps:

  1. The leader appends a special configuration entry C-old,new to its log and replicates it. The instant a server stores this entry (even before it's committed) it starts using it for all future decisions.
  2. While C-old,new is in force, every election and every commit needs a majority of C-old and a majority of C-new. Neither old nor new can act alone.
  3. Once C-old,new is committed, the leader appends a final C-new entry. After that commits, the transition is done; servers not in C-new shut down.

The overlap at each hop (old↔joint, joint↔new) guarantees a single decision-maker throughout. Clients keep being served during the change — there is no downtime.

Single-server-at-a-time — the simpler method

Joint consensus is fiddly to implement, so Ongaro's dissertation describes a simpler approach that most systems use: add or remove only one server per configuration change. Adding or removing a single server cannot create two disjoint majorities — the old and new majorities are guaranteed to overlap (the math: changing the count by one can't split the cluster into two non-overlapping majority halves). To go from 3 to 5, you do two single-server changes in sequence. This is what etcd and Consul use in practice.

Best practice: When adding a brand-new, empty server, bring it in as a non-voting learner first. It catches up on the log via AppendEntries without counting toward any majority. Only promote it to a full voter once its log is current. Otherwise a slow, empty new member can stall commits (the leader can't reach a majority that includes a member with nothing) or even trigger spurious leader changes. etcd's "learner" and Consul's non-voting members exist for exactly this.
Analogy: Changing a board of directors mid-meeting. If half the room thinks Alice was just removed and half thinks she's still voting, you can ratify two contradictory motions. Joint consensus is like requiring, during the handover, that a motion pass under both the old bylaws and the new ones simultaneously — so it's impossible to slip two conflicting decisions through. The single-at-a-time method is the conservative rule "only seat or unseat one director per meeting," which never produces two competing majorities in the first place.

4.6 Log compaction & snapshotting

The log grows forever — every command is a new entry. Left unchecked it would fill the disk and make restarts (which replay the whole log) unbearably slow. The fix is a snapshot: a compact dump of the state machine's current state at a point in time. Once you have a snapshot up to index N, you can throw away log entries 1…N, because their combined effect is already captured in the snapshot.

Each server snapshots independently. A snapshot records the state plus two pieces of bookkeeping: lastIncludedIndex (the last log index it covers) and lastIncludedTerm (that entry's term) — kept so the Log Matching consistency check still works at the boundary after the entries are gone.

This creates a corner case: what if a follower is so far behind that the leader has already discarded the entries that follower still needs? The leader can't send entries it no longer has. So Raft adds a third RPC, InstallSnapshot: the leader ships its whole snapshot to the lagging follower, who installs it wholesale (discarding its own now-superseded log) and resumes normal AppendEntries from the snapshot's end. Its fields: term, leaderId, lastIncludedIndex, lastIncludedTerm, offset + data (the snapshot bytes, sent in chunks for large states), and done (last chunk flag).

Example: A 5-server cluster has processed a million commands. Each server periodically snapshots — say at index 950,000 — and deletes entries 1…950,000, keeping only the recent tail. A follower that was offline for an hour comes back asking for index 200,000, which no longer exists anywhere. The leader sends an InstallSnapshot with lastIncludedIndex = 950,000; the follower replaces its stale log with that snapshot in one shot, then catches up the last few thousand entries normally — far faster than replaying a million.
Common mistake: Treating snapshotting as an optional add-on. Without it, a long-running Raft cluster's restart time and disk use grow without bound, and a briefly-offline follower can take a very long time to rejoin by re-streaming the entire history. Snapshotting is mandatory for any production deployment, not a nice-to-have.

Common mistakes & misconceptions

Common mistake: Thinking the consistency check compares whole logs. It checks a single entry at prevLogIndex. The Log Matching Property is what makes that one check imply the entire prefix matches.
Common mistake: Assuming "replicated on a majority" always means "committed." It only does for entries in the leader's current term. Counting replicas to commit an older-term entry directly is the Figure-8 bug and can lose data.
Common mistake: Believing the longest log always wins an election. Raft compares the last term first, and only uses length as a tie-breaker when last terms are equal. A shorter log with a higher last term is "more up-to-date."
Common mistake: Changing cluster membership by just editing a config file and restarting nodes. Without joint consensus or single-at-a-time changes, you can create two disjoint majorities and elect two leaders (split brain).
Common mistake: Thinking a follower's discarded "extra" entries were a bug or data loss. Those uncommitted entries were never on a majority and no client was ever told they succeeded, so deleting them is correct and loses nothing the system promised.

Best practices

Best practice: Use the fast log-backtracking optimization (return conflicting term + first index of that term on a rejection) instead of decrementing nextIndex by one per RPC. It turns slow per-entry repair into per-term jumps.
Best practice: Add new members as non-voting learners and let them catch up before promoting them to voters, so an empty newcomer never stalls commits or destabilizes elections.
Best practice: Change membership one server at a time (or use proper joint consensus). Never do a multi-server swap in a single step.
Best practice: Snapshot regularly and tune the trigger (e.g. every N entries or M bytes). Keep lastIncludedIndex/lastIncludedTerm so the boundary consistency check still works, and stream large snapshots in chunks.
Best practice: Persist currentTerm, the vote, and the log to stable storage before responding to any RPC. Acknowledging an entry you haven't durably stored breaks the majority-durability guarantee on a crash.

Section summary

  • AppendEntries carries prevLogIndex/prevLogTerm as a fingerprint; a follower accepts new entries only if that fingerprint matches, which enforces the Log Matching Property (same index+term ⇒ identical logs up to there).
  • A divergent follower is repaired by the leader backing up nextIndex until the logs agree, then overwriting the follower's conflicting tail with its own entries.
  • An entry is committed once stored on a majority; the leader then advances commitIndex and tells followers via leaderCommit.
  • A leader commits by replica-count only for entries in its own term; older entries commit indirectly once a current-term entry on top of them commits — this is the Figure-8 fix.
  • The Election Restriction (vote only for an at-least-as-up-to-date log, compared by last term then last index) plus majority overlap yields Leader Completeness and therefore State Machine Safety: committed entries are never lost or reordered.
  • Naive membership changes can create two majorities and split brain; Raft prevents this with joint consensus (C-old,new) or the simpler single-server-at-a-time method.
  • Logs can't grow forever; snapshots compact committed state, and InstallSnapshot brings a far-behind follower up to date when needed entries have been discarded.
  • Across the protocol, "committed" is an absolute promise — once an entry is committed, every future leader has it and every server eventually applies it at the same index.

Continue reading