Raft — Leader Election
Raft is an algorithm for keeping a replicated log in agreement across a cluster of servers. A "replicated log" is just an ordered list of commands (like "set x = 3", then "set y = 9") that every server must store in the same order, so that every server ends up in the same state. The hard part is agreeing on that order when machines crash, restart, and the network drops or delays messages. That agreement problem is called consensus.
Raft was created by Diego Ongaro and John Ousterhout (Stanford, 2014) with one explicit, unusual goal: understandability. Earlier algorithms (Paxos) are famously hard to reason about, so Raft deliberately breaks the problem into three smaller, mostly independent pieces:
- Leader election — pick exactly one server to be in charge. (This section.)
- Log replication — the leader copies its log to everyone. (Later section.)
- Safety — guarantee the leader can never overwrite already-agreed history. (Later section.)
Raft's central design choice is the strong leader: at any moment there is one leader, and all changes to the log flow only from the leader outward to the others. Followers never talk to each other about log content. This single rule removes a huge amount of complexity — but it means everything depends on reliably electing a leader, and re-electing one the instant the old leader dies. That is what this section is about.
3.1 The three server states
At every moment, each Raft server is in exactly one of three states (a "state" here just means "the role this server is currently playing"):
| State | What it does | How many at once |
|---|---|---|
| Follower | Passive. Issues no requests on its own. Only responds to messages from leaders and candidates. Redirects clients to the leader. | Usually all-but-one |
| Candidate | A follower that has decided to try to become leader. It is asking the others to vote for it. | Zero or more (briefly) |
| Leader | Handles all client requests, owns the log, and continuously tells followers it is alive. | At most one per term |
In normal, healthy operation there is exactly one leader and every other server is a follower. The candidate state exists only during the brief, occasional moment of electing a new leader.
The state machine (transitions)
A server moves between these states based on what it hears (or fails to hear) from the network. Here is the full state diagram — this is the heart of leader election:
times out, times out,
starts up starts election new election
| +-----+ +-----+
v | | | |
+--------+ no heartbeat| v | v wins votes from
| | (election | +-----------+ | +-----------+ majority
| FOLLOWER|--timeout)---->| | CANDIDATE |---+->| CANDIDATE |------------+
| | | +-----------+ +-----------+ |
+--------+ |
^ ^ v
| | +--------+
| | discovers current leader, | |
| | or a server/RPC with a HIGHER term | LEADER |
| +--------------------------------------------------------| |
| +--------+
| discovers a server with a higher term |
+----------------------------------------------------------------+
Read the transitions as plain rules:
- Follower → Candidate: a follower hears nothing from any leader for too long (its election timeout fires), so it starts an election.
- Candidate → Leader: the candidate collects votes from a majority of the cluster.
- Candidate → Follower: the candidate discovers a legitimate current leader (or any message carrying a higher term — defined below), so it gives up and steps down.
- Candidate → Candidate: the election timed out with no winner (a split vote), so the candidate starts a fresh election.
- Leader → Follower: the leader learns of a higher term than its own, meaning a newer election has happened, so it steps down immediately.
3.2 Terms — Raft's logical clock
Before we can run an election, we need the concept of a term.
What it is: a term is a numbered period of time. Terms are numbered with consecutive integers: term 1, term 2, term 3, and so on. Each term begins with an election. If the election produces a winner, that server is leader for the rest of that term. If the election fails to produce a winner, the term ends with no leader and the next term (with a new election) begins.
The problem it solves: there is no global clock in a distributed system (you already know this). Servers can't agree on wall-clock time. But they can agree on "which election are we on?" The term number is a logical clock — a counter that only moves forward — that lets every server detect stale information. A message from an old term is obviously out of date.
term 1 term 2 t3 term 4
+--------+ +-----------------+ +------+ +-------------+
|election| |election normal | |elect.| |elect. normal|
| -> op | | -> oper. | |(none)| | -> oper. |
+--------+ +-----------------+ +------+ +-------------+
^
|
split vote: term ended
with NO emerging leader
---------------------------------------------------> time (terms)
Notice term 3 (t3) is short and has no leader — that's an election that failed (a split vote). Raft handles this gracefully: it just rolls into term 4 and tries again.
Two rules make terms powerful:
- At most one leader per term (the Election Safety property). A term might have zero leaders (failed election) but never two. We'll prove why below.
- Every message carries the sender's term number. Every server stores its own
currentTerm. When two servers communicate, they compare terms:- If a server sees a term larger than its own
currentTerm, it updatescurrentTermto the larger value and, if it was a candidate or leader, immediately reverts to follower. (It has been left behind by a newer election.) - If a server receives a request with a stale (smaller) term, it rejects the request.
- If a server sees a term larger than its own
3.3 Heartbeats — how the leader stays in power
How do followers know the leader is still alive? Through heartbeats.
The leader periodically sends an AppendEntries RPC to every follower. (An RPC, "remote procedure call", is just a message that asks another server to run something and reply.) AppendEntries is the same RPC the leader uses to replicate log entries — but when it carries no log entries (an empty entries[] list), it functions purely as a heartbeat: a "I'm still here, stay calm" signal.
Each time a follower receives a valid AppendEntries from the current leader, it resets its election timer. As long as heartbeats keep arriving, no follower ever times out, so no follower ever starts an election, so the leader keeps its job. The leader sends heartbeats even when idle (no client requests), precisely to prevent followers from timing out.
Leader Follower A Follower B Follower C
| | | |
|--AppendEntries--->| | | (empty = heartbeat)
|--AppendEntries--------------------->| |
|--AppendEntries-------------------------------------> |
| | reset timer | reset timer | reset timer
| ...wait... | | |
|--AppendEntries--->|--------------------->|----------->| (next heartbeat)
| | reset timer | reset timer | reset timer
3.4 The election timeout — what kicks off an election
The election timeout is the amount of time a follower waits, hearing nothing from a leader, before it gives up and tries to become leader itself.
If a follower's election timeout elapses without receiving any AppendEntries from a current leader (and without granting a vote to a candidate — more on that soon), it assumes there is no viable leader and begins an election. To begin an election the follower does four things, in order:
- Transitions to candidate state.
- Increments its
currentTerm(e.g. from 2 to 3) — starting a brand-new term. - Votes for itself (sets its
votedForto its own id, and counts one vote). - Resets its election timer and sends
RequestVoteRPCs, in parallel, to every other server.
The RequestVote RPC — exact fields
RequestVote is the RPC a candidate sends to ask "please vote for me as leader." Here are its exact arguments and results, straight from the Raft specification:
| Direction | Field | Meaning (plain words) |
|---|---|---|
| Argument | term | The candidate's term number (the new, incremented one). |
| Argument | candidateId | Which server is requesting the vote (so the voter can record it). |
| Argument | lastLogIndex | The position (index) of the last entry in the candidate's log — how long its log is. |
| Argument | lastLogTerm | The term of that last log entry — how recent its log is. |
| Result | term | The voter's currentTerm, so a behind candidate can update itself. |
| Result | voteGranted | true means "yes, you have my vote." |
The last two arguments — lastLogIndex and lastLogTerm — exist for the safety restriction we preview in §3.7. For now, just note that the candidate advertises how up-to-date its log is.
3.5 Winning the election — majority and at-most-one-vote
A candidate wins the election if it receives voteGranted = true from a majority of all servers in the cluster (not just a majority of those that replied). "Majority" means more than half. In a 5-server cluster, a majority is 3.
The voting rule on each server is simple and strict:
- Each server may vote for at most one candidate per term, on a first-come-first-served basis. It records its choice in
votedFor. Once it has voted in term T, it will refuse any other candidate in term T. - A server rejects a
RequestVotewhosetermis smaller than its owncurrentTerm(stale candidate). - (Plus the up-to-date-log check from §3.7.)
When a candidate gathers a majority, it immediately becomes leader and starts sending heartbeats to assert its authority and stop any other elections from starting.
Why "at most one leader per term" is guaranteed
This is the elegant part. Suppose, for contradiction, two different candidates both won the same term. Each would need votes from a majority of the cluster. But any two majorities of the same set must overlap in at least one server (two groups each bigger than half can't be disjoint). That overlapping server would have had to vote twice in the same term — which the "at most one vote per term" rule forbids. Contradiction. Therefore at most one candidate can win a given term. This is the Election Safety property.
5 servers: S1 S2 S3 S4 S5
Candidate A's majority: [ S1 S2 S3 ]
Candidate B's majority: [ S3 S4 S5 ]
^^
|
S3 is in BOTH — it cannot have
voted for both A and B in one term.
=> one of them cannot reach a majority.
A full election timeline
S1(Leader,term2) S2(Follower) S3(Follower) S4(Follower) S5(Follower)
| | | | |
|--heartbeat----->|------------->|------------->|------------->|
X <-- S1 CRASHES | | | |
| ...no heartbeats arriving... |
| timer fires (S2 first) |
| term 2 -> 3, vote self, become CANDIDATE |
|--RequestVote(term=3, id=S2)--------------->| (to all)
| | | |
|<--voteGranted--|<--granted----|<--granted----|
| got votes: S2(self)+S3+S4 = 3 of 5 = MAJORITY
| become LEADER of term 3
|--heartbeat(term=3)----------->...---------->|
| everyone resets timers, recognizes S2
Sequence: leader dies → a follower times out → it becomes a candidate in a new term → it sends RequestVote to all → it collects a majority → it becomes the new leader → its heartbeats restore calm. Total time is typically a few hundred milliseconds.
3.6 Split votes and randomized timeouts
There is a third possible outcome of an election: the candidate neither wins nor loses. This is a split vote.
How it happens: if several followers time out at nearly the same instant, they all become candidates in the same new term and all start begging for votes simultaneously. The votes get divided among them, and no single candidate reaches a majority. Each candidate then sits, having voted for itself, unable to win and unwilling to vote for anyone else this term.
5 servers. S2 and S4 time out at the SAME time -> both candidates, term 3.
S1 S2(cand) S3 S4(cand) S5
| <--ReqVote-- | | --ReqVote-->| |
| | --ReqVote--> | | <--ReqVote--|... (crossing)
Votes land: S2 gets: S2(self) + S1 = 2
S4 gets: S4(self) + S5 = 2
S3 already voted for S2? or S4? say it voted S4 -> S4=3?
(In the classic split:) S2 = S1,S2 | S4 = S4,S5 | S3 hasn't decided
-> 2 vs 2, nobody reaches 3. *** SPLIT VOTE ***
-> term 3 ends with NO leader. Everyone waits, times out again.
Without a countermeasure, split votes could repeat forever: every candidate times out at the same time again, splits again, term after term. Raft's fix is beautifully simple — randomized election timeouts.
Instead of every server using the same fixed timeout, each server picks its election timeout randomly from a fixed range — the paper uses 150–300 ms as an example. So the timeouts are spread out. In most cases, one server times out clearly before any other, wins the whole election uncontested, and sends heartbeats before anyone else even wakes up. The same randomization is reused after a split: each candidate restarts its timer with a fresh random value before the next attempt, so the next round is very unlikely to tie again.
Split happened in term 3. Now each picks a NEW random timeout for term 4:
S2 random timeout = 270ms |------------------------------|fires
S4 random timeout = 165ms |----------------|fires <-- WINS THE RACE
|
S4 times out first, becomes candidate (term 4),
collects a majority, and sends heartbeats
BEFORE S2's 270ms is up -> S2 hears S4's heartbeat,
resets its timer, stays a follower. No second split.
---------------------------------------------> time
This is a probabilistic guarantee, not a hard one: a tie is possible in any single round, but its probability shrinks rapidly because two independent random picks rarely match closely, and each retry re-rolls the dice. In practice elections resolve in one or two rounds.
3.7 The "up-to-date log" voting restriction (preview)
So far a follower grants its vote to the first candidate that asks in a term. But there is one more condition, and it's the bridge to Raft's safety guarantees. A server will grant its vote only if the candidate's log is at least as up-to-date as its own log.
The problem it solves: the leader is the single source of truth for the log, and Raft never lets a leader overwrite committed history. So we must never elect a leader that is missing entries other servers have already committed — otherwise the new leader's incomplete log would become the official one, and agreed-upon commands would vanish. The vote is the gate that keeps a behind server from ever becoming leader.
How "up-to-date" is compared (this is exactly why lastLogTerm and lastLogIndex are in RequestVote):
- Compare the term of the last entry first. The log whose last entry has the higher
lastLogTermis more up-to-date. - If the last terms are equal, then the longer log (higher
lastLogIndex) is more up-to-date.
A voter grants its vote only when the candidate's (lastLogTerm, lastLogIndex) is greater than or equal to the voter's own. If the candidate is behind on either measure, the voter says voteGranted = false.
- Candidate A: last entry index 6, term 4. Same last term (4), longer log (6 ≥ 5) → at least as up-to-date → S3 grants the vote.
- Candidate B: last entry index 8, term 3. Longer log (8), but its last term (3) is lower than S3's (4). Term is compared first, so B is behind despite being longer → S3 refuses. B has a longer-but-older log and must not lead.
We only preview this here. Why this exact comparison is sufficient to guarantee that a newly elected leader holds every committed entry — the Leader Completeness property — is the subject of the safety section. For now, hold onto this: elections aren't just popularity contests; a candidate must also prove its log is current before anyone will vote for it.
RequestVote. Majority voting decides who leads; the log comparison decides who is even eligible to lead. Together they ensure the winner never destroys committed data.
Common mistakes & misconceptions
lastLogTerm/lastLogIndex check, even if it asked first. Eligibility comes before popularity.
Best practices
broadcastTime ≪ electionTimeout ≪ MTBF (mean time between failures). The paper's 150–300 ms suits a fast LAN; use larger values for cross-datacenter links so transient slowness doesn't trigger needless elections.
currentTerm and votedFor to stable storage before responding to any RPC. If a server crashes and restarts within a term, it must remember it already voted, or it could vote a second time and break the at-most-one-leader guarantee.
Section summary
- Every Raft server is a Follower, Candidate, or Leader; normal operation has exactly one leader and the rest followers.
- Terms are an ever-increasing logical clock; each term starts with an election and has at most one leader. Every message carries a term, and a higher term forces stale leaders/candidates back to follower.
- The leader keeps power by broadcasting heartbeats (empty
AppendEntries); receiving one resets a follower's election timer. - A follower whose election timeout elapses becomes a candidate: it increments its term, votes for itself, and sends
RequestVote(term, candidateId, lastLogIndex, lastLogTerm)to all servers. - A candidate wins by collecting votes from a majority of the whole cluster; each server votes once per term, first-come, which (via quorum overlap) guarantees at most one winner.
- Split votes happen when servers time out together; randomized timeouts (e.g. 150–300 ms) de-synchronize them so one candidate usually wins quickly and repeated splits are rare.
- An election can end with no leader; Raft just starts a new term and retries.
- The up-to-date-log restriction — comparing
lastLogTermthenlastLogIndex— means a candidate whose log is behind cannot win, linking election to safety and setting up the Leader Completeness guarantee.