Revision Cheat Sheet

By Pritesh Yadav 4 min read
One-line everything: Consensus = agree on one ordered log of commands, using a majority quorum, so identical replicated state machines stay in lockstep despite crashes and partitions.

The Consensus Requirements

  • Agreement — no two correct nodes decide differently (safety).
  • Validity — the decided value was actually proposed by someone (safety).
  • Termination — every correct node eventually decides (liveness).
  • FLP — in a fully async network with crashes, no deterministic algorithm guarantees all three; real systems use timeouts + randomness to get practical liveness, never sacrificing safety.

Quorum / Majority Rule

  • Quorum = strict majority = ⌊n/2⌋ + 1.
  • Tolerate f failures → need 2f+1 nodes (3 tolerate 1, 5 tolerate 2).
  • Safety comes from overlap: any two majorities share ≥1 node, so two conflicting decisions can't both win.
  • Use odd node counts — an even extra node raises the threshold without adding fault tolerance.

Raft — Server States

StateRole
FollowerPassive; responds to RPCs; starts election if heartbeat times out.
CandidateBumps term, votes for self, requests votes; wins on majority.
LeaderHandles all client writes; sends heartbeats; replicates the log.

Raft — The Two RPCs

RPCSent byPurposeGranted/accepted when
RequestVoteCandidateWin an electionVoter hasn't voted this term AND candidate's log is ≥ as up-to-date.
AppendEntriesLeaderReplicate entries / heartbeat (empty)Term is current AND prev entry's index+term match (log-matching check).

Raft — Key Properties

  • Election safety — at most one leader per term.
  • Log matching — same index+term ⇒ logs identical up to that point.
  • Leader completeness — a committed entry exists in every future leader's log.
  • Committed = replicated to a majority by the current leader ⇒ durable, never lost.
  • Membership changes via joint consensus (old+new majorities both required) to avoid split-brain; snapshots compact the log and catch up laggards.

Paxos — Two Phases (Single-Decree)

PhaseProposer asksAcceptor does
1: Prepare / Promise"Promise to ignore ballots < N"Promises; returns any value already accepted.
2: Accept / Accepted"Accept value V at ballot N"Accepts if it hasn't promised a higher N. Majority accept ⇒ chosen.
Paxos safety rule: if any acceptor already accepted a value, a new proposer must re-propose that same value (the highest-ballot one it learns about in Phase 1) — this is what prevents two different values being chosen.
  • Roles: Proposer (drives), Acceptor (votes), Learner (finds out the result).
  • Multi-Paxos: elect a stable leader once, skip Phase 1 for later slots ⇒ steady-state commit = one round trip (just like Raft).

Raft vs Paxos — One-Liners

  • Same problem, same majority foundation; both reduce to agreeing on a log.
  • Raft = prescriptive, single strong leader, structured log, built for understandability.
  • Paxos = minimal, general, leaderless in theory; leaves leader election + multi-decision as DIY.
  • Multi-Paxos and Raft converge on "stable leader + replicated log" in practice.
  • Term (Raft) = Epoch = Ballot (Paxos) — the increasing number that exposes stale leaders.
  • Both are crash-fault tolerant only — neither handles Byzantine/malicious nodes (use PBFT for that).

Real-World Systems

SystemAlgorithmUsed for
etcdRaftKubernetes' config/state store.
ConsulRaftService discovery, config, locks.
CockroachDB / TiKVRaft (per shard)Distributed SQL/KV replication.
ZooKeeperZAB (Raft-like)Coordination, locks, leader election.
Google Spanner / Chubby(Multi-)PaxosGlobally-distributed DB; lock service.
PBFT / blockchainsBFT consensusAgreement among untrusted/malicious nodes.

Rules of Thumb

Use 3 nodes for "survive one failure," 5 for "survive two / maintenance + a failure." Beyond 7, write throughput suffers — shard into multiple groups instead.
Don't roll your own consensus. Use a proven Raft library (etcd/Hashicorp). Subtle bugs in election or commit logic cause silent data loss that tests rarely catch.
Every decision costs at least one majority round trip — budget that latency. Cross-datacenter clusters pay WAN latency on every write; keep the quorum geographically tight or accept the cost.
  • Partition ⇒ only the majority side makes progress; the minority pauses (consistency over availability).
  • Scale reads with follower reads + leader leases; scale writes by sharding into many consensus groups.
  • Cap the log with snapshots; bootstrap new nodes from a snapshot, not full replay.

Continue reading