Frequently Asked Questions

By Pritesh Yadav 7 min read

Q: Raft or Paxos — which should I actually use?

A: For almost any new system, reach for a battle-tested Raft library (etcd's, or Hashicorp's) — Raft was explicitly designed to be understandable and operable, and the ecosystem of tooling is large. Plain Paxos is famously hard to implement correctly; the version people deploy is Multi-Paxos, which ends up structurally close to Raft anyway. Use Paxos only when you're adopting a system already built on it (e.g. Spanner) or you need a specific variant like EPaxos. Either way, don't write your own from scratch — use a proven implementation.

Q: Why a majority and not all nodes?

A: Requiring all nodes means a single failure or a single slow node stops all progress — you'd have zero fault tolerance. A majority lets the cluster keep deciding while a minority is down. Safety still holds because any two majorities must share at least one node, and that overlapping node prevents two conflicting decisions from both "winning."

Q: What happens if two leaders exist at once?

A: It can briefly happen — e.g. an old leader is partitioned away and doesn't yet know a new one was elected. It's safe, though, because terms/epochs are increasing numbers. The new leader operates in a higher term; the moment the old leader contacts a node with the newer term, it learns it's stale and steps down. Crucially, the old leader can't commit anything, because committing needs a majority, and the majority has already moved to the new term.

Q: Does consensus survive a network partition?

A: Safety always survives — you will never get split-brain conflicting commits. Availability does not always survive: only the side of the partition that still holds a majority can make progress; the minority side pauses writes until the partition heals. This is the CAP trade-off in action — these systems choose consistency over availability when forced to pick.

Q: How many nodes do I need, and why are 3 or 5 typical?

A: To tolerate f failures you need 2f+1 nodes. So 3 nodes tolerate 1 failure and 5 tolerate 2 — the sweet spots for most production clusters. 3 is the cheapest cluster that survives any single failure; 5 buys you more resilience and lets you lose a node for maintenance while still tolerating one unplanned failure.

Q: Why odd numbers of nodes?

A: Adding an even-numbered extra node gives you no extra fault tolerance but raises the majority threshold, so it can actually hurt. A 4-node cluster still only tolerates 1 failure (majority is 3) — same as 3 nodes — yet has more machines to fail. Odd sizes give the best resilience per node and avoid tie-prone majorities.

Q: Can't I just use a single database instead?

A: A single database is simpler and consistent, but it's a single point of failure — when it goes down, everything goes down, and a crash can lose recent writes. Consensus exists precisely to replicate that database across machines so it survives failures without serving conflicting data. If you genuinely don't need high availability or durability beyond one box, a single DB is fine — don't add consensus you don't need.

Q: How do Raft and Paxos relate — is Raft just Paxos?

A: They solve the same problem with the same majority-quorum foundation, and Multi-Paxos and Raft converge on a similar "stable leader + replicated log" shape. The differences are in framing: Raft prescribes a single strong leader, a structured log with the log-matching rule, and explicit membership changes, all chosen for clarity. Classic Paxos is more minimal and general but leaves leader election and the multi-decision machinery as exercises, which is why real Paxos deployments reinvent much of what Raft specifies.

Q: What is a replicated state machine and why does every section mention the log?

A: It's the key reduction. If every node runs the same deterministic state machine and applies the exact same commands in the exact same order, all nodes end up identical. So the only thing you need consensus on is the order of commands — i.e. the contents of an append-only log. Both Raft and Paxos are, at heart, machines for agreeing on a log.

Q: If FLP proves consensus is impossible, how do real systems work?

A: FLP says no deterministic algorithm can guarantee termination in a fully asynchronous network where a node might crash — it's about a theoretical worst case where you can never tell "crashed" from "slow." Real systems dodge it by adding timeouts and randomness: they assume the network is usually timely. They never violate safety; they only risk a temporary stall in pathological conditions, which is an acceptable trade.

Q: When is an entry "committed," and can a committed entry ever be lost?

A: In Raft, an entry is committed once the current leader has replicated it to a majority of servers. After that it's permanent — the voting rules guarantee every future leader already has it, so it's never overwritten or lost. Uncommitted entries (replicated to only a minority) can be discarded if leadership changes; that's correct, because no client was ever told they succeeded.

Q: Why does the leader handle all the writes — isn't that a bottleneck?

A: A single leader makes reasoning about ordering trivial and keeps the protocol simple, which is the whole point of Raft. It is a throughput ceiling, yes. Systems scale around it by sharding (many Raft groups, each with its own leader), by batching and pipelining entries, and by serving some reads from followers with leases. Leaderless variants like EPaxos remove the single leader but pay in complexity.

Q: What's the difference between safety and liveness, and which wins?

A: Safety = "nothing bad ever happens" (no two conflicting decisions). Liveness = "something good eventually happens" (a decision is reached). Consensus protocols always prioritize safety: if forced to choose, they'd rather stop making progress than risk a wrong agreement. That's why a minority partition halts instead of inventing its own answer.

Q: Do Raft and Paxos protect against malicious or buggy nodes?

A: No. They assume the crash-fault (non-Byzantine) model: nodes may stop, slow down, or restart, but they never lie or send conflicting messages on purpose. Defending against malicious participants needs Byzantine Fault Tolerant protocols like PBFT (and the consensus mechanisms behind many blockchains), which require more nodes (3f+1) and more communication.

Q: What's a heartbeat and what is an election timeout?

A: The heartbeat is the leader's periodic "I'm alive" message (an empty AppendEntries) that stops followers from starting elections. The election timeout is how long a follower waits with no heartbeat before deciding the leader is gone and becoming a candidate. The timeout is randomized per node so that when a leader dies, one follower usually times out first and wins cleanly, avoiding split votes.

Q: How does the cluster change membership without breaking safety?

A: Naively swapping the member list is dangerous — for a moment the old and new sets could each form a majority and elect rival leaders (split-brain). Raft avoids this with joint consensus: a transitional configuration where decisions need a majority from both the old and the new sets simultaneously, so there's never a window where two disjoint majorities exist. Once everyone has the joint config, the cluster moves to the new one.

Q: Why do these systems use snapshots?

A: The log grows forever otherwise. A snapshot captures the state machine's current state so all the log entries that produced it can be discarded (log compaction), keeping disk and replay times bounded. Snapshots also let a brand-new or far-behind server catch up instantly by installing the snapshot instead of replaying the whole history.

Continue reading