Revision Cheat Sheet
By Pritesh Yadav 4 min read —
One-line everything: Consensus = agree on one ordered log of commands, using a majority quorum, so identical replicated state machines stay in lockstep despite crashes and partitions.
The Consensus Requirements
- Agreement — no two correct nodes decide differently (safety).
- Validity — the decided value was actually proposed by someone (safety).
- Termination — every correct node eventually decides (liveness).
- FLP — in a fully async network with crashes, no deterministic algorithm guarantees all three; real systems use timeouts + randomness to get practical liveness, never sacrificing safety.
Quorum / Majority Rule
- Quorum = strict majority = ⌊n/2⌋ + 1.
- Tolerate f failures → need 2f+1 nodes (3 tolerate 1, 5 tolerate 2).
- Safety comes from overlap: any two majorities share ≥1 node, so two conflicting decisions can't both win.
- Use odd node counts — an even extra node raises the threshold without adding fault tolerance.
Raft — Server States
| State | Role |
|---|---|
| Follower | Passive; responds to RPCs; starts election if heartbeat times out. |
| Candidate | Bumps term, votes for self, requests votes; wins on majority. |
| Leader | Handles all client writes; sends heartbeats; replicates the log. |
Raft — The Two RPCs
| RPC | Sent by | Purpose | Granted/accepted when |
|---|---|---|---|
| RequestVote | Candidate | Win an election | Voter hasn't voted this term AND candidate's log is ≥ as up-to-date. |
| AppendEntries | Leader | Replicate entries / heartbeat (empty) | Term is current AND prev entry's index+term match (log-matching check). |
Raft — Key Properties
- Election safety — at most one leader per term.
- Log matching — same index+term ⇒ logs identical up to that point.
- Leader completeness — a committed entry exists in every future leader's log.
- Committed = replicated to a majority by the current leader ⇒ durable, never lost.
- Membership changes via joint consensus (old+new majorities both required) to avoid split-brain; snapshots compact the log and catch up laggards.
Paxos — Two Phases (Single-Decree)
| Phase | Proposer asks | Acceptor does |
|---|---|---|
| 1: Prepare / Promise | "Promise to ignore ballots < N" | Promises; returns any value already accepted. |
| 2: Accept / Accepted | "Accept value V at ballot N" | Accepts if it hasn't promised a higher N. Majority accept ⇒ chosen. |
Paxos safety rule: if any acceptor already accepted a value, a new proposer must re-propose that same value (the highest-ballot one it learns about in Phase 1) — this is what prevents two different values being chosen.
- Roles: Proposer (drives), Acceptor (votes), Learner (finds out the result).
- Multi-Paxos: elect a stable leader once, skip Phase 1 for later slots ⇒ steady-state commit = one round trip (just like Raft).
Raft vs Paxos — One-Liners
- Same problem, same majority foundation; both reduce to agreeing on a log.
- Raft = prescriptive, single strong leader, structured log, built for understandability.
- Paxos = minimal, general, leaderless in theory; leaves leader election + multi-decision as DIY.
- Multi-Paxos and Raft converge on "stable leader + replicated log" in practice.
- Term (Raft) = Epoch = Ballot (Paxos) — the increasing number that exposes stale leaders.
- Both are crash-fault tolerant only — neither handles Byzantine/malicious nodes (use PBFT for that).
Real-World Systems
| System | Algorithm | Used for |
|---|---|---|
| etcd | Raft | Kubernetes' config/state store. |
| Consul | Raft | Service discovery, config, locks. |
| CockroachDB / TiKV | Raft (per shard) | Distributed SQL/KV replication. |
| ZooKeeper | ZAB (Raft-like) | Coordination, locks, leader election. |
| Google Spanner / Chubby | (Multi-)Paxos | Globally-distributed DB; lock service. |
| PBFT / blockchains | BFT consensus | Agreement among untrusted/malicious nodes. |
Rules of Thumb
Use 3 nodes for "survive one failure," 5 for "survive two / maintenance + a failure." Beyond 7, write throughput suffers — shard into multiple groups instead.
Don't roll your own consensus. Use a proven Raft library (etcd/Hashicorp). Subtle bugs in election or commit logic cause silent data loss that tests rarely catch.
Every decision costs at least one majority round trip — budget that latency. Cross-datacenter clusters pay WAN latency on every write; keep the quorum geographically tight or accept the cost.
- Partition ⇒ only the majority side makes progress; the minority pauses (consistency over availability).
- Scale reads with follower reads + leader leases; scale writes by sharding into many consensus groups.
- Cap the log with snapshots; bootstrap new nodes from a snapshot, not full replay.