The Consensus Problem
This whole guide is about one deceptively small word: agree. You already know that in a distributed system there are many machines (we call them nodes), the network can lose, delay, duplicate, or reorder messages, and any node can crash at any moment without warning. Now we ask a brutal question: given all that chaos, how do we get a group of machines to agree on a single value and never go back on it? That question is the consensus problem, and solving it correctly is the hardest and most important thing in distributed systems. Raft and Paxos — the algorithms this guide teaches — exist for exactly one reason: to solve consensus. So before we touch either of them, we need to understand the problem they are solving, why it is so hard, and what "correct" even means.
1.1 What "consensus" actually means
Consensus is the act of getting a group of nodes to all decide on the same single value, even though some of those nodes may crash and some messages may be lost or delayed. The value can be anything: a number, the name of a leader, the next operation to run, "yes commit this transaction", or "node 7 holds the lock now". The key word is single — everyone who decides must decide on the same thing, and once a node has decided, it can never change its mind.
Let's be precise about the setup, because the precision is what makes it hard:
- Several nodes each start with a proposed value (an input they would like the group to choose). For a leader election, each node might propose "I should be leader" or "node 3 should be leader".
- The nodes exchange messages over an unreliable network.
- At the end, each non-crashed node outputs a decided value.
- The protocol is correct only if all those decided values are identical, the decided value was one that somebody actually proposed, and the nodes eventually reach a decision rather than arguing forever.
If that sounds easy, remember: you cannot tell a crashed node from a slow one (no global clock, unreliable network), messages can arrive out of order or never arrive, and any node — including the one currently "in charge" — can die mid-sentence. You have to be correct anyway.
1.2 Why we need consensus — concrete uses
Consensus is not an academic toy. Almost every reliable distributed system has a consensus algorithm beating at its core. Here are the jobs it does:
- Electing one leader. Many systems want exactly one node "in charge" at a time (to avoid two nodes giving conflicting orders). The cluster must agree on who the leader is. If two nodes each believe they are the sole leader — a split brain — you get data corruption. Consensus prevents that.
- Agreeing on the order of operations. If five replicas of a database each apply writes in a different order, they end up with different data. Consensus lets the cluster agree on one ordered log of operations, so every replica applies the same operations in the same order and ends up identical. This is the heart of replicated state machines, which we cover later.
- Committing a transaction. "Did we all agree to commit, or all agree to abort?" Every participant must reach the same verdict — you cannot have one node commit while another aborts.
- Agreeing who holds a lock. A distributed lock is just consensus on the question "which client currently owns this lock?" If two clients think they hold the same lock, they can both edit the same resource and corrupt it.
- Agreeing on cluster membership and configuration. Which nodes are part of the cluster right now? What is the current config? When you add or remove a server, every node must agree on the new membership at the same logical moment, or messages get sent to ghosts and quorums get miscounted.
etcd (the database behind Kubernetes) stores critical cluster state — which pods exist, what the desired configuration is. It runs the Raft consensus algorithm internally. When you run kubectl apply, the change is proposed to the Raft cluster; a majority of etcd nodes must agree to append it to the shared log before it is considered committed. That is why Kubernetes can survive a node dying without losing or corrupting its cluster state — consensus guarantees every surviving etcd node has the exact same data in the exact same order.
1.3 The formal requirements — in plain words
A protocol "solves consensus" only if it satisfies three properties. These are the rules of the game. Memorize them; everything Raft and Paxos do is in service of these three.
| Property | Plain-English meaning | What breaks if you violate it |
|---|---|---|
| Agreement | No two non-faulty nodes decide on different values. If I decided "X", nobody else ever decides "Y". | Split brain, divergent replicas, two leaders, corrupted data. |
| Validity (also called Integrity) | The value that gets decided must be one that some node actually proposed. The system cannot invent a value out of thin air, and a node decides at most once. | The cluster "agrees" on a value nobody asked for — agreement on garbage is useless. |
| Termination (also called Liveness) | Every non-faulty node eventually decides something. The protocol does not get stuck forever. | The cluster argues forever and never makes progress; the system hangs. |
Notice the tension. Agreement and Validity are about never being wrong. Termination is about eventually making progress. A protocol that does nothing at all is perfectly safe (it never decides anything wrong) but useless (it never decides anything at all). A protocol that decides instantly always terminates but might let two nodes pick different values. The art of consensus is satisfying all three at once, and as we will see, in the worst case you literally cannot guarantee all three.
1.4 Safety vs liveness — the two halves of "correct"
Distributed-systems people split "correctness" into two categories, and you will hear these words constantly:
- Safety = "never do a wrong thing." A safety property says some bad thing never happens. Agreement and Validity are safety properties: we never let two nodes decide differently, and we never invent a value. If a safety property is violated even once, the damage is permanent — you cannot un-corrupt the data.
- Liveness = "eventually do a good thing." A liveness property says some good thing eventually happens. Termination is a liveness property: eventually a decision gets made. Violating liveness means the system is slow or stuck, which is bad — but it is recoverable. When the network heals, it can make progress again.
The crucial engineering insight: good consensus algorithms never sacrifice safety, only liveness. When the network is broken or too many nodes are down, Raft and Paxos will refuse to make progress (they pause, breaking liveness temporarily) rather than risk deciding two different values (breaking safety forever). A stalled cluster is annoying; a corrupted cluster is a disaster. They always pick "annoying".
1.5 The FLP impossibility result — why this is provably hard
Here is the bombshell that shaped the entire field. In 1985, three researchers — Michael Fischer, Nancy Lynch, and Mike Paterson — proved a theorem now known as FLP impossibility. We will only build the intuition, not the formal proof, but the intuition is what matters.
First, two definitions:
- A fully asynchronous system is one where there is no upper bound on how long a message can take to arrive and no upper bound on how long a node can take to do a step. Messages always arrive eventually, but "eventually" could be one millisecond or one million years — you are never allowed to assume a limit.
- A deterministic algorithm is one with no randomness: given the same inputs and messages, it always does the same thing.
The FLP theorem states:
In a fully asynchronous system, there is no deterministic algorithm that solves consensus and is guaranteed to terminate, if even a single node may crash.
Read that again. Even with just one possible crash, and even though messages are never actually lost (just arbitrarily delayed), you cannot build a deterministic protocol that always satisfies all three properties. Some execution will always exist where the protocol runs forever without deciding.
The intuition for why. The root cause is the thing you already know from the foundations guide: in an asynchronous system you cannot tell a crashed node from a slow node. When you are waiting for a message from node 3, and it has not arrived, you face an impossible choice:
- If you wait for node 3, and node 3 has actually crashed, you wait forever — that violates Termination.
- If you stop waiting and decide without node 3, and node 3 was merely slow (not crashed) and was about to send a message that would have changed the outcome — you might decide the wrong value, violating Agreement.
The proof formalizes this with a clever argument: it shows there is always a configuration where the decision could still go either way (the experts call this a bivalent state — "two-valued", meaning both outcomes are still possible). An adversary who controls only the timing of messages (which it is allowed to do, since timing is unbounded) can always delay exactly the one message that would tip the system toward a decision, keeping it perpetually undecided. The system never crashes, never loses a message — it just stays forever on the fence.
The impossible dilemma at the heart of FLP
------------------------------------------
You are node A, waiting to hear from node 3 before deciding.
? no message yet from node 3 ?
/ \
/ \
WAIT for it / \ DECIDE without it
/ \
node 3 was node 3 was only SLOW
actually CRASHED (its msg arrives later
-> you wait and would have changed
FOREVER the answer)
-> Termination -> you decided WRONG
broken -> Agreement broken
You cannot tell which branch you are in.
No global clock + unbounded delays = no safe choice.
1.6 How real systems escape FLP
FLP describes a worst case that real networks almost never hit. Real systems sidestep it by adding a pinch of assumption that the pure-asynchronous model forbids. There are three classic escape hatches:
| Escape hatch | What it adds | How it dodges FLP | Used by |
|---|---|---|---|
| Timeouts / partial synchrony | Assume that after some unknown point, the network behaves "well enough" — messages arrive within some bound. Use a clock-based timeout to suspect a node has crashed. | A timeout lets you stop waiting for a (probably) dead node and make progress. You might occasionally be wrong (a slow node looks dead), but the algorithm stays safe — it just retries. This restores liveness during good periods. | Raft, Multi-Paxos, ZAB, etcd, Consul |
| Randomization | Let nodes flip coins so the algorithm is not deterministic. | FLP only forbids deterministic protocols. With randomness, you can guarantee termination with probability 1 (it terminates "almost surely"), dodging the theorem's deterministic assumption. | Randomized Byzantine protocols; Raft uses randomized election timeouts to break ties. |
| Failure detectors | Add an oracle that gives hints about which nodes have crashed (it may be wrong sometimes). | Even an imperfect failure detector — one that's eventually accurate — provides just enough extra information to make consensus solvable. | The theory underpinning all of the above. |
The most important one in practice is partial synchrony via timeouts. Raft is built entirely on it: a node waits for a heartbeat from the leader, and if no heartbeat arrives before its election timeout fires, it assumes the leader is dead and starts a new election. The timeout is the practical answer to "is it crashed or just slow?" — after waiting long enough, treat it as crashed and move on. If you were wrong, no harm to safety; the system just sorts it out.
1.7 Quorums and majorities — why "more than half" is magic
Now the single most important mechanical idea in all of consensus. You already met quorums at a basic level; here is why they make consensus correct.
A quorum is the minimum number of nodes that must agree before a decision counts. The quorum that consensus algorithms use is a majority: strictly more than half of all nodes. In a cluster of N nodes, a majority is floor(N/2) + 1 nodes.
Cluster size N | Majority needed | Failures tolerated (N − majority) |
|---|---|---|
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
Why a majority specifically, and not, say, "any 2 nodes"? Because of one beautiful, simple fact:
Any two majorities of the same set must share at least one common member.
This is just counting. If each of two groups contains more than half the nodes, then together they contain more than all the nodes — which is impossible unless they overlap. So they must have at least one node in common. That one shared node is the secret to safety.
Cluster of 5 nodes: [ n1 n2 n3 n4 n5 ]
A majority = 3 nodes.
Majority A (votes for value X): n1 n2 n3
Majority B (votes for value Y): n3 n4 n5
^^
n3 is in BOTH groups.
n3 can only have voted for ONE value.
So A and B cannot both succeed with different values.
=> conflicting decisions are mathematically impossible.
Two groups of "more than half" of 5 -> 3 + 3 = 6 > 5
-> the pigeonhole forces an overlap.
Here is why that overlap guarantees Agreement. Suppose value X was decided because a majority of nodes accepted it. Later, some other process tries to get value Y decided — it too needs a majority to accept Y. But the X-majority and the Y-majority must overlap in at least one node. That overlapping node already accepted X. The protocol's rules forbid it from also accepting a conflicting Y (each node accepts at most one conflicting value, or carries forward the value it already saw). So Y can never gather a clean majority for a different value. The first decision is locked in. The overlap is the lock.
This same overlap is also what lets a new leader safely learn what the old leader committed. When a new leader is elected by a majority, that electing majority is guaranteed to share at least one node with any majority that committed an entry — so the new leader is guaranteed to see any already-committed data and never erase it. (Raft's "Leader Completeness" property, which the Raft paper proves, rests on exactly this overlap.)
Common mistakes & misconceptions
Best practices
Section summary
- Consensus is getting a group of nodes to agree on a single value and never reverse it, despite crashes, lost messages, and delays — it is the foundational hard problem of distributed systems.
- We need it for leader election, ordering operations, committing transactions, distributed locks, and cluster membership — anywhere "everyone must agree exactly once" matters.
- A correct consensus protocol must satisfy Agreement (no two decide differently), Validity/Integrity (the value was actually proposed), and Termination/Liveness (a decision is eventually reached).
- Safety = "never do a wrong thing" (permanent if violated); liveness = "eventually do a good thing" (recoverable). Good algorithms sacrifice liveness, never safety.
- The FLP impossibility result proves that in a fully asynchronous, deterministic system with even one possible crash, you cannot guarantee consensus always terminates — because you can't distinguish a crashed node from a slow one.
- Real systems escape FLP with timeouts (partial synchrony), randomization, and failure detectors — keeping safety absolute and buying back liveness during good network periods.
- Consensus uses majority quorums (more than half) because any two majorities must overlap in at least one node, and that shared node cannot endorse two conflicting values.
- The overlap property is the mathematical lock that guarantees Agreement and lets new leaders safely inherit committed data — it is the core idea beneath both Raft and Paxos.
- Use odd-sized, small clusters; at most one side of any network partition can hold a majority, which is precisely what prevents split-brain.