The Consensus Problem

By Pritesh Yadav 20 min read

This whole guide is about one deceptively small word: agree. You already know that in a distributed system there are many machines (we call them nodes), the network can lose, delay, duplicate, or reorder messages, and any node can crash at any moment without warning. Now we ask a brutal question: given all that chaos, how do we get a group of machines to agree on a single value and never go back on it? That question is the consensus problem, and solving it correctly is the hardest and most important thing in distributed systems. Raft and Paxos — the algorithms this guide teaches — exist for exactly one reason: to solve consensus. So before we touch either of them, we need to understand the problem they are solving, why it is so hard, and what "correct" even means.

1.1 What "consensus" actually means

Consensus is the act of getting a group of nodes to all decide on the same single value, even though some of those nodes may crash and some messages may be lost or delayed. The value can be anything: a number, the name of a leader, the next operation to run, "yes commit this transaction", or "node 7 holds the lock now". The key word is single — everyone who decides must decide on the same thing, and once a node has decided, it can never change its mind.

Let's be precise about the setup, because the precision is what makes it hard:

  • Several nodes each start with a proposed value (an input they would like the group to choose). For a leader election, each node might propose "I should be leader" or "node 3 should be leader".
  • The nodes exchange messages over an unreliable network.
  • At the end, each non-crashed node outputs a decided value.
  • The protocol is correct only if all those decided values are identical, the decided value was one that somebody actually proposed, and the nodes eventually reach a decision rather than arguing forever.

If that sounds easy, remember: you cannot tell a crashed node from a slow one (no global clock, unreliable network), messages can arrive out of order or never arrive, and any node — including the one currently "in charge" — can die mid-sentence. You have to be correct anyway.

Analogy: Imagine a group of friends planning where to eat dinner, but they can only communicate by passing paper notes through an unreliable courier. Some notes get lost. Some friends leave the room without telling anyone (crash). Some friends are just slow to reply — and from the outside you cannot tell "left the room" apart from "still thinking". The challenge is to make sure that when the night ends, everyone who is still around went to the exact same restaurant, that the restaurant was one somebody actually suggested, and that they didn't stand around forever unable to decide. That is consensus.

1.2 Why we need consensus — concrete uses

Consensus is not an academic toy. Almost every reliable distributed system has a consensus algorithm beating at its core. Here are the jobs it does:

  • Electing one leader. Many systems want exactly one node "in charge" at a time (to avoid two nodes giving conflicting orders). The cluster must agree on who the leader is. If two nodes each believe they are the sole leader — a split brain — you get data corruption. Consensus prevents that.
  • Agreeing on the order of operations. If five replicas of a database each apply writes in a different order, they end up with different data. Consensus lets the cluster agree on one ordered log of operations, so every replica applies the same operations in the same order and ends up identical. This is the heart of replicated state machines, which we cover later.
  • Committing a transaction. "Did we all agree to commit, or all agree to abort?" Every participant must reach the same verdict — you cannot have one node commit while another aborts.
  • Agreeing who holds a lock. A distributed lock is just consensus on the question "which client currently owns this lock?" If two clients think they hold the same lock, they can both edit the same resource and corrupt it.
  • Agreeing on cluster membership and configuration. Which nodes are part of the cluster right now? What is the current config? When you add or remove a server, every node must agree on the new membership at the same logical moment, or messages get sent to ghosts and quorums get miscounted.
Example: etcd (the database behind Kubernetes) stores critical cluster state — which pods exist, what the desired configuration is. It runs the Raft consensus algorithm internally. When you run kubectl apply, the change is proposed to the Raft cluster; a majority of etcd nodes must agree to append it to the shared log before it is considered committed. That is why Kubernetes can survive a node dying without losing or corrupting its cluster state — consensus guarantees every surviving etcd node has the exact same data in the exact same order.

1.3 The formal requirements — in plain words

A protocol "solves consensus" only if it satisfies three properties. These are the rules of the game. Memorize them; everything Raft and Paxos do is in service of these three.

PropertyPlain-English meaningWhat breaks if you violate it
Agreement No two non-faulty nodes decide on different values. If I decided "X", nobody else ever decides "Y". Split brain, divergent replicas, two leaders, corrupted data.
Validity (also called Integrity) The value that gets decided must be one that some node actually proposed. The system cannot invent a value out of thin air, and a node decides at most once. The cluster "agrees" on a value nobody asked for — agreement on garbage is useless.
Termination (also called Liveness) Every non-faulty node eventually decides something. The protocol does not get stuck forever. The cluster argues forever and never makes progress; the system hangs.

Notice the tension. Agreement and Validity are about never being wrong. Termination is about eventually making progress. A protocol that does nothing at all is perfectly safe (it never decides anything wrong) but useless (it never decides anything at all). A protocol that decides instantly always terminates but might let two nodes pick different values. The art of consensus is satisfying all three at once, and as we will see, in the worst case you literally cannot guarantee all three.

Common mistake: Thinking Agreement just means "the nodes vote and majority wins". Agreement is much stronger: once any node has decided, that decision is final and no other node may ever decide differently — not even after a crash, a network partition, a restart, or a leader change. The decision must survive failures. That permanence is what makes it hard.

1.4 Safety vs liveness — the two halves of "correct"

Distributed-systems people split "correctness" into two categories, and you will hear these words constantly:

  • Safety = "never do a wrong thing." A safety property says some bad thing never happens. Agreement and Validity are safety properties: we never let two nodes decide differently, and we never invent a value. If a safety property is violated even once, the damage is permanent — you cannot un-corrupt the data.
  • Liveness = "eventually do a good thing." A liveness property says some good thing eventually happens. Termination is a liveness property: eventually a decision gets made. Violating liveness means the system is slow or stuck, which is bad — but it is recoverable. When the network heals, it can make progress again.

The crucial engineering insight: good consensus algorithms never sacrifice safety, only liveness. When the network is broken or too many nodes are down, Raft and Paxos will refuse to make progress (they pause, breaking liveness temporarily) rather than risk deciding two different values (breaking safety forever). A stalled cluster is annoying; a corrupted cluster is a disaster. They always pick "annoying".

Analogy: Safety is "the bank never lets your balance go negative or double-spends your money." Liveness is "the ATM eventually gives you your cash." If the network to the bank is down, a well-designed ATM says "sorry, try later" (liveness fails, temporarily) rather than handing out money it can't verify (safety fails, permanently). You would much rather wait than have your account corrupted. Consensus algorithms make the same choice.

1.5 The FLP impossibility result — why this is provably hard

Here is the bombshell that shaped the entire field. In 1985, three researchers — Michael Fischer, Nancy Lynch, and Mike Paterson — proved a theorem now known as FLP impossibility. We will only build the intuition, not the formal proof, but the intuition is what matters.

First, two definitions:

  • A fully asynchronous system is one where there is no upper bound on how long a message can take to arrive and no upper bound on how long a node can take to do a step. Messages always arrive eventually, but "eventually" could be one millisecond or one million years — you are never allowed to assume a limit.
  • A deterministic algorithm is one with no randomness: given the same inputs and messages, it always does the same thing.

The FLP theorem states:

In a fully asynchronous system, there is no deterministic algorithm that solves consensus and is guaranteed to terminate, if even a single node may crash.

Read that again. Even with just one possible crash, and even though messages are never actually lost (just arbitrarily delayed), you cannot build a deterministic protocol that always satisfies all three properties. Some execution will always exist where the protocol runs forever without deciding.

The intuition for why. The root cause is the thing you already know from the foundations guide: in an asynchronous system you cannot tell a crashed node from a slow node. When you are waiting for a message from node 3, and it has not arrived, you face an impossible choice:

  • If you wait for node 3, and node 3 has actually crashed, you wait forever — that violates Termination.
  • If you stop waiting and decide without node 3, and node 3 was merely slow (not crashed) and was about to send a message that would have changed the outcome — you might decide the wrong value, violating Agreement.

The proof formalizes this with a clever argument: it shows there is always a configuration where the decision could still go either way (the experts call this a bivalent state — "two-valued", meaning both outcomes are still possible). An adversary who controls only the timing of messages (which it is allowed to do, since timing is unbounded) can always delay exactly the one message that would tip the system toward a decision, keeping it perpetually undecided. The system never crashes, never loses a message — it just stays forever on the fence.

   The impossible dilemma at the heart of FLP
   ------------------------------------------

   You are node A, waiting to hear from node 3 before deciding.

                  ?  no message yet from node 3  ?
                 / \
                /   \
   WAIT for it /     \ DECIDE without it
              /       \
   node 3 was         node 3 was only SLOW
   actually CRASHED   (its msg arrives later
   -> you wait        and would have changed
      FOREVER         the answer)
   -> Termination     -> you decided WRONG
      broken          -> Agreement broken

   You cannot tell which branch you are in.
   No global clock + unbounded delays = no safe choice.
Common mistake: Reading FLP as "consensus is impossible, so Raft and Paxos are lying." Not true. FLP says you cannot guarantee termination in every possible execution of a fully asynchronous, deterministic system. It does not say you can't be correct. Raft and Paxos are always safe (Agreement and Validity hold in every execution, no exceptions). They only give up the impossible guarantee of termination — and in practice they terminate essentially always. FLP constrains the guarantee, not the usefulness.

1.6 How real systems escape FLP

FLP describes a worst case that real networks almost never hit. Real systems sidestep it by adding a pinch of assumption that the pure-asynchronous model forbids. There are three classic escape hatches:

Escape hatchWhat it addsHow it dodges FLPUsed by
Timeouts / partial synchrony Assume that after some unknown point, the network behaves "well enough" — messages arrive within some bound. Use a clock-based timeout to suspect a node has crashed. A timeout lets you stop waiting for a (probably) dead node and make progress. You might occasionally be wrong (a slow node looks dead), but the algorithm stays safe — it just retries. This restores liveness during good periods. Raft, Multi-Paxos, ZAB, etcd, Consul
Randomization Let nodes flip coins so the algorithm is not deterministic. FLP only forbids deterministic protocols. With randomness, you can guarantee termination with probability 1 (it terminates "almost surely"), dodging the theorem's deterministic assumption. Randomized Byzantine protocols; Raft uses randomized election timeouts to break ties.
Failure detectors Add an oracle that gives hints about which nodes have crashed (it may be wrong sometimes). Even an imperfect failure detector — one that's eventually accurate — provides just enough extra information to make consensus solvable. The theory underpinning all of the above.

The most important one in practice is partial synchrony via timeouts. Raft is built entirely on it: a node waits for a heartbeat from the leader, and if no heartbeat arrives before its election timeout fires, it assumes the leader is dead and starts a new election. The timeout is the practical answer to "is it crashed or just slow?" — after waiting long enough, treat it as crashed and move on. If you were wrong, no harm to safety; the system just sorts it out.

Key takeaway: FLP is not a wall that stops us; it is a signpost telling us where the danger is. It says: "You can't have guaranteed termination and guaranteed safety in a purely asynchronous world." Real systems respond: "Fine — we keep safety absolute, and we buy back liveness with timeouts and randomness, accepting that in a truly pathological network we might pause." Every consensus algorithm you will learn is a different, careful way of making that trade.

1.7 Quorums and majorities — why "more than half" is magic

Now the single most important mechanical idea in all of consensus. You already met quorums at a basic level; here is why they make consensus correct.

A quorum is the minimum number of nodes that must agree before a decision counts. The quorum that consensus algorithms use is a majority: strictly more than half of all nodes. In a cluster of N nodes, a majority is floor(N/2) + 1 nodes.

Cluster size NMajority neededFailures tolerated (N − majority)
321
532
743

Why a majority specifically, and not, say, "any 2 nodes"? Because of one beautiful, simple fact:

Any two majorities of the same set must share at least one common member.

This is just counting. If each of two groups contains more than half the nodes, then together they contain more than all the nodes — which is impossible unless they overlap. So they must have at least one node in common. That one shared node is the secret to safety.

  Cluster of 5 nodes:   [ n1  n2  n3  n4  n5 ]
  A majority = 3 nodes.

  Majority A (votes for value X):   n1  n2  n3
  Majority B (votes for value Y):           n3  n4  n5
                                            ^^
                              n3 is in BOTH groups.

  n3 can only have voted for ONE value.
  So A and B cannot both succeed with different values.
  => conflicting decisions are mathematically impossible.

  Two groups of "more than half" of 5  ->  3 + 3 = 6 > 5
  -> the pigeonhole forces an overlap.

Here is why that overlap guarantees Agreement. Suppose value X was decided because a majority of nodes accepted it. Later, some other process tries to get value Y decided — it too needs a majority to accept Y. But the X-majority and the Y-majority must overlap in at least one node. That overlapping node already accepted X. The protocol's rules forbid it from also accepting a conflicting Y (each node accepts at most one conflicting value, or carries forward the value it already saw). So Y can never gather a clean majority for a different value. The first decision is locked in. The overlap is the lock.

This same overlap is also what lets a new leader safely learn what the old leader committed. When a new leader is elected by a majority, that electing majority is guaranteed to share at least one node with any majority that committed an entry — so the new leader is guaranteed to see any already-committed data and never erase it. (Raft's "Leader Completeness" property, which the Raft paper proves, rests on exactly this overlap.)

Analogy: Picture a club with 5 members deciding a rule by show of hands, but votes happen at different times in different rooms and people come and go. You declare a rule "passed" only if at least 3 of the 5 raised their hand for it. Now suppose a sneaky member later tries to pass a contradictory rule, also needing 3 hands. Because 3 + 3 = 6 is more than the 5 total members, at least one person would have to have raised their hand for both contradictory rules — and that person remembers they already committed to the first one and refuses. The majority rule makes it physically impossible for two contradictory rules to both pass. That is exactly why consensus uses majorities.
Common mistake: Using an even number of nodes thinking "more nodes = more reliable". A 4-node cluster needs a majority of 3 and tolerates only 1 failure — exactly the same fault tolerance as a 3-node cluster, but with more machines to fail and slower writes. Worse, even sizes make ties more likely. Consensus clusters are almost always odd (3, 5, 7) for the best failures-tolerated-per-node ratio.
Common mistake: Believing a network partition can let "both sides keep working". With majority quorums, at most one side of any partition can contain a majority. If you split a 5-node cluster into a group of 3 and a group of 2, only the group of 3 can make decisions; the group of 2 cannot reach quorum and must stop accepting writes. This is the algorithm choosing safety over availability on the minority side — and it is exactly what prevents split-brain.
Key takeaway: Majority quorums turn the abstract goal "never let two nodes disagree" into a concrete, countable rule. Because any two majorities share a member, and that shared member can't endorse two conflicting values, conflicting decisions are mathematically impossible. This single idea — overlap — is the load-bearing wall under both Raft and Paxos.

Common mistakes & misconceptions

Common mistake: Thinking consensus means "the nodes take a vote and majority wins, once." Real consensus must keep Agreement holding forever, across crashes, restarts, partitions, and leader changes — the decision can never be reversed or contradicted later.
Common mistake: Reading FLP impossibility as "consensus can't be done, so Raft/Paxos are broken." FLP forbids only a guarantee of termination in a fully asynchronous, deterministic model. Raft and Paxos are always safe and terminate in practice; they only forfeit the impossible liveness guarantee, and only during pathological network conditions.
Common mistake: Confusing safety with liveness. "It's slow / stuck" is a liveness problem (recoverable). "Two nodes decided different values" is a safety problem (permanent corruption). Good algorithms sacrifice liveness to protect safety — never the reverse.
Common mistake: Assuming more nodes always means more fault tolerance. A 4-node cluster tolerates the same single failure as a 3-node cluster but is slower and tie-prone. Fault tolerance jumps only at odd sizes (3→1, 5→2, 7→3).
Common mistake: Thinking a node can tell "crashed" from "slow". In an asynchronous network it fundamentally cannot — that indistinguishability is the entire source of difficulty, and timeouts are only a guess, not a fact.

Best practices

Best practice: Run consensus clusters with an odd number of nodes (3 or 5 for most workloads). It maximizes failures-tolerated per machine and avoids tie scenarios.
Best practice: When you design or evaluate a consensus-based system, check the two halves separately: "Is it always safe (never two conflicting decisions)?" and "Is it live during normal operation?" Demand absolute safety; accept that liveness can pause during severe network failures.
Best practice: Do not invent your own consensus protocol. Reuse a proven implementation (etcd/Raft, ZooKeeper/ZAB, Consul) — the subtle overlap and commit rules are exactly where hand-rolled protocols silently violate Agreement.
Best practice: Keep a consensus cluster small and well-connected. Every committed decision needs a round-trip to a majority, so latency and write throughput degrade as you add nodes; scale read capacity with replicas/followers, not by enlarging the voting set.
Best practice: Tune timeouts to your real network's behavior. Too short and healthy-but-slow nodes get falsely suspected (needless leader churn); too long and genuine failures take ages to recover from. The timeout is your practical stand-in for the unsolvable "crashed vs slow" question.

Section summary

  • Consensus is getting a group of nodes to agree on a single value and never reverse it, despite crashes, lost messages, and delays — it is the foundational hard problem of distributed systems.
  • We need it for leader election, ordering operations, committing transactions, distributed locks, and cluster membership — anywhere "everyone must agree exactly once" matters.
  • A correct consensus protocol must satisfy Agreement (no two decide differently), Validity/Integrity (the value was actually proposed), and Termination/Liveness (a decision is eventually reached).
  • Safety = "never do a wrong thing" (permanent if violated); liveness = "eventually do a good thing" (recoverable). Good algorithms sacrifice liveness, never safety.
  • The FLP impossibility result proves that in a fully asynchronous, deterministic system with even one possible crash, you cannot guarantee consensus always terminates — because you can't distinguish a crashed node from a slow one.
  • Real systems escape FLP with timeouts (partial synchrony), randomization, and failure detectors — keeping safety absolute and buying back liveness during good network periods.
  • Consensus uses majority quorums (more than half) because any two majorities must overlap in at least one node, and that shared node cannot endorse two conflicting values.
  • The overlap property is the mathematical lock that guarantees Agreement and lets new leaders safely inherit committed data — it is the core idea beneath both Raft and Paxos.
  • Use odd-sized, small clusters; at most one side of any network partition can hold a majority, which is precisely what prevents split-brain.

Continue reading