Vector Clocks & Causality

By Pritesh Yadav 19 min read

In the previous section you met Lamport clocks — simple counters that give every event a number so we can put events in some order. Lamport clocks are useful, but they have a famous blind spot. This section fixes that blind spot with a tool called the vector clock. By the end you will be able to look at two events in a distributed system and say, with certainty, one of exactly three things: "this one definitely came first", "that one definitely came first", or "these two happened at the same time and nobody can break the tie". That third answer — concurrency — is the whole point, and it is the foundation of how real databases like Amazon Dynamo and Riak detect conflicting writes.

Let me define a few words up front, in plain English, before we use them:

  • Node (also called a process or replica): one independent computer in the system. It runs on its own and talks to others only by sending messages.
  • Event: one thing that happens on a node — doing some local work, sending a message, or receiving a message.
  • Causality: a fancy word for cause-and-effect. Event a "causes" event b if information from a could have travelled to b and influenced it. If no information could have travelled between them, neither caused the other.
  • Happened-before (written a → b): the formal name for "a could have influenced b". This is Lamport's relation from the last section. Two events with no happened-before link in either direction are called concurrent — they are independent.

4.1 Why Lamport clocks are not enough

A Lamport clock gives every event a single number, written C(e). Its one guarantee is this:

If a → b (a happened-before b), then C(a) < C(b).

Read carefully: this guarantee only runs in one direction. It says causality forces the numbers to go up. It does not say the reverse. So if you are handed two numbers and you see C(a) < C(b), you cannot conclude that a caused b. The smaller number might just be a coincidence — two totally independent events on two different nodes can easily end up with C(a) = 3 and C(b) = 7 even though neither knew the other existed.

In logic terms: a → b implies C(a) < C(b), but C(a) < C(b) does not imply a → b. The implication is one-way. We say the Lamport clock condition is necessary but not sufficient for causality.

Analogy: Lamport numbers are like the page numbers in a giant printed book where many authors wrote on separate typewriters, then someone shuffled and stapled all the pages into one pile. If page 12 quotes page 5, you know page 5 was written first — quoting proves a real link. But if page 5 and page 30 simply sit in that order in the pile, that tells you nothing: page 30 might have been typed years earlier by an author who never read page 5. The page number orders the pile, but it does not prove who influenced whom.

Why does this matter? Because the most valuable question in a distributed database is exactly the one Lamport clocks cannot answer: "Did these two writes happen one-after-another (so I can safely keep the later one), or did they happen at the same time on different replicas that hadn't yet talked (so I have a genuine conflict)?" To answer that, we need a clock that captures causality in both directions. That clock is the vector clock.

Key takeaway: A Lamport clock can order events but cannot detect concurrency. C(a) < C(b) never proves causation. Vector clocks were invented to close exactly this gap.

4.2 What a vector clock is

A vector clock is, instead of a single number per event, a whole list of numbers — one slot for every node in the system. If there are three nodes — call them A, B, and C — then every vector clock has three slots, like [A:2, B:5, C:1] or just [2, 5, 1] if we agree on a fixed order.

Each node owns exactly one slot — its own. The number in node A's own slot counts "how many events have happened on A so far". The numbers in the other slots are A's best knowledge of "how many events have happened on B and on C, as far as A has heard". So a vector clock is really a little summary: "Here is everything I have personally done, plus everything I have heard about that everyone else has done."

Vector clocks were invented independently in 1988 by two researchers: Colin Fidge ("Timestamps in Message-Passing Systems That Preserve the Partial Ordering", 11th Australian Computer Science Conference) and Friedemann Mattern ("Virtual Time and Global States of Distributed Systems", a 1988 workshop in Château de Bonas, France). Their key result is called the strong clock condition: with vector clocks, a → b is true if and only if VC(a) < VC(b). Notice "if and only if" — the implication now runs both ways, which is precisely what Lamport clocks could not do.

4.3 The three update rules

A vector clock changes by exactly three simple rules. Say node i holds a vector V, and V[i] means "node i's own slot".

  1. On a local event (the node does some internal work): increment your own slot by one. V[i] = V[i] + 1. Nothing else changes.
  2. On sending a message: first increment your own slot (V[i] = V[i] + 1), then attach a copy of your entire vector to the outgoing message. The receiver will need it.
  3. On receiving a message that carries a vector M: first increment your own slot (V[i] = V[i] + 1), then for every slot k, set V[k] = max(V[k], M[k]) — take the bigger of your number and the sender's number, slot by slot. This is called the element-wise maximum.

The element-wise max in rule 3 is the magic. It means "merge in everything the sender knew that I didn't". If the sender had heard about 5 events on node C and I had only heard about 2, after the merge I know about 5 — I've absorbed their knowledge.

Analogy: Think of three friends keeping a shared gossip ledger. Each friend's ledger has a row per friend: "how many stories I've personally lived" plus "how many stories I've heard each other friend has lived". When you do something noteworthy, you bump your own row. When you phone a friend, you read them your whole ledger; they bump their own row, then update each row to the higher of the two counts — so nobody ever loses a story they'd already heard about. The ledger isn't a wall-clock time; it's a record of who-knows-about-what.

4.4 A full worked example with three nodes

Let's run three nodes — A, B, C — all starting at [0, 0, 0]. We'll track the vector after each event. Slots are in the order [A, B, C].

Time ──────────────────────────────────────────────────────────────▶

A: [0,0,0]   a1          a2 ─────────msg m1──────────┐
             [1,0,0]     [2,0,0]                     │
                                                     ▼
B: [0,0,0]                          b1               r1(recv m1)   b2
                                    [0,1,0]          [2,2,0]       [2,3,0]
                                                                    │
                                                          msg m2 ───┘
                                                                    │
C: [0,0,0]        c1                                                 ▼
                  [0,0,1]                                  r2(recv m2)
                                                           [2,3,2]

Legend:  a1,a2,b1,b2,c1 = local events;  m1,m2 = messages;  r1,r2 = receives

Step by step:

  1. a1 on A: local event → A's slot goes from 0 to 1. A = [1,0,0].
  2. c1 on C: local event → C = [0,0,1]. (A and C have never communicated, so they know nothing about each other.)
  3. b1 on B: local event → B = [0,1,0].
  4. a2 on A, then A sends message m1: a2 is the send event, so A increments its own slot to 2: A = [2,0,0]. The message m1 carries the stamp [2,0,0].
  5. r1 on B receives m1: B first increments its own slot (1 → 2), giving [0,2,0], then takes the element-wise max with the message's [2,0,0]: max(0,2)=2 for A, max(2,0)=2 for B, max(0,0)=0 for C → B = [2,2,0]. B now knows A has done 2 events.
  6. b2 on B, then B sends m2: increment B's own slot → B = [2,3,0]. Message m2 carries [2,3,0].
  7. r2 on C receives m2: C increments its own slot (1 → 2) → [0,0,2], then element-wise max with [2,3,0]: → C = [2,3,2]. C now knows about A's 2 events and B's 3 events, all from one message chain.

Notice the causality captured in the final vector on C, [2,3,2]: it "contains" A's work and B's work because information flowed A → B → C. That is exactly the happened-before chain, recorded inside the numbers.

4.5 Comparing two vector clocks — the three possible answers

This is the payoff. Given two vector clocks U and V, compare them slot by slot. There are exactly three outcomes:

ResultDefinition (compare every slot)Meaning
Equal (U = V)Every slot is equal.The same point in time / the same event.
Happened-before (U < V)Every slot of U is the matching slot of V, and at least one slot is strictly <.U causally came first; U → V. V is a descendant of U.
Concurrent (U ∥ V)Neither U < V nor V < U holds — each has at least one slot bigger than the other.Independent. Neither caused the other. Potential conflict.

The crucial rule to memorise: "less-than-or-equal in every slot, and strictly less in at least one" means happened-before. If you cannot say that about either direction, the events are concurrent.

Example — three comparisons:
  • [1,0,0] vs [2,2,0]: every slot of the first is ≤ the second (1≤2, 0≤2, 0≤0) and at least one is strictly smaller → first happened-before second. The first event causally precedes the second.
  • [2,0,0] vs [0,1,0]: first has A=2 > 0, but second has B=1 > 0. Each beats the other in some slot → concurrent. (These are events a2 and b1 from our diagram — and indeed they happened before A and B ever talked.)
  • [2,3,0] vs [2,3,2]: 2≤2, 3≤3, 0≤2, one strictly smaller → first happened-before second.
Common mistake: People think "I'll just compare the sums" — add up the slots and treat the bigger total as later. This is wrong. [2,0,0] (sum 2) and [0,1,0] (sum 1) have different sums but are actually concurrent. Summing throws away exactly the per-node detail that detects concurrency. You must compare slot-by-slot.

4.6 Detecting concurrent updates — the basis of "siblings"

Now connect this to a real database. Imagine a key-value store that keeps copies of your data on several nodes for availability (so the store stays up even if one node dies). Each stored version of a value carries a vector clock. When a node receives a write, it compares the incoming version's clock to the version it already has:

  • If the new clock is greater than (happened-after) the old one → the new write is a straightforward, causal update. Overwrite the old version. No conflict.
  • If the new clock is less than the old one → the new write is stale (an old update arriving late). Keep the version you have.
  • If the two clocks are concurrent → two clients updated the same key independently, on replicas that hadn't synced. Neither is "newer". The database keeps both versions. These coexisting concurrent versions are called siblings.

The database cannot decide which sibling is "right" — that's a business question (which shopping cart? which user profile?). So it hands all siblings back to the application at the next read and asks it to merge them. This is called semantic reconciliation: the app applies domain rules (for example, "union the two shopping carts so no item is lost") and writes back one reconciled value whose new vector clock descends from both siblings, collapsing the branch.

4.7 Real-world: Amazon Dynamo and the shopping-cart example

This design is exactly what Amazon described in their landmark 2007 Dynamo paper ("Dynamo: Amazon's Highly Available Key-value Store"). Dynamo chose availability over strong consistency — the shopping cart must always accept an "add to cart", even during network trouble. That means two replicas can take writes independently and disagree, so Dynamo uses vector clocks (it calls them version vectors, stored as a list of (node, counter) pairs) to tell apart genuine causal updates from real conflicts.

Here is the actual example from the Dynamo paper, with server nodes Sx, Sy, Sz handling writes to one shopping-cart object:

            WRITE handled by Sx                D1  [(Sx,1)]
                     │                          │
            WRITE handled by Sx                D2  [(Sx,2)]   (D2 descends from D1)
                     │
        ┌────────────┴────────────┐
        │ (network split:         │
        │  two replicas write     │
        │  without seeing         │
        │  each other)            │
        ▼                          ▼
  WRITE on Sy                WRITE on Sz
  D3 [(Sx,2),(Sy,1)]         D4 [(Sx,2),(Sz,1)]
        │                          │
        └────────────┬─────────────┘
                     ▼
          client READ sees BOTH D3 and D4
          D3 vs D4:  D3 has (Sy,1) but no Sz
                     D4 has (Sz,1) but no Sy
                     ⇒ NEITHER ≤ the other ⇒ CONCURRENT
                     ⇒ they are SIBLINGS, returned together
                     │
                     ▼
       client reconciles (merges carts) and writes back via Sx
          D5 [(Sx,3),(Sy,1),(Sz,1)]
          D5 descends from BOTH D3 and D4 ⇒ conflict resolved

Walk through the clocks: D3 = [(Sx,2),(Sy,1)] and D4 = [(Sx,2),(Sz,1)]. Compare slot by slot. For the Sy slot, D3 has 1 and D4 has 0 (absent counts as 0) — D3 wins. For the Sz slot, D4 has 1 and D3 has 0 — D4 wins. Each beats the other somewhere, so neither happened-before the other: they are concurrent, i.e. a real conflict. Dynamo returns both as siblings. The client merges and writes D5 = [(Sx,3),(Sy,1),(Sz,1)], which is both D3 and D4 in every slot — so D5 unambiguously supersedes both, and the branch is healed.

Riak, the open-source database directly inspired by Dynamo, uses the same idea and exposes siblings to your application when an allow_mult setting is on. Riak later improved on plain vector clocks with dotted version vectors (DVV) — a refinement that avoids a subtle problem where naive client-id vectors grow without bound and can produce "false siblings". With DVV the clock size stabilises around the replication factor (typically 3) instead of growing with the number of clients.

Key takeaway: "Siblings" are just two versions whose vector clocks compare as concurrent. Vector clocks are the mechanism that lets an always-available database know when it has a genuine conflict to hand back, versus a safe overwrite.

4.8 Causality cousins: Git's DAG and CRDTs

You already use a causality-tracking system every day: Git. Git doesn't use vector clocks, but it captures the same happened-before idea with a DAG (Directed Acyclic Graph — a graph of nodes with one-way arrows and no loops). Each commit points back to its parent commit(s). "Commit X is an ancestor of commit Y" is Git's version of "X happened-before Y". When two people branch from the same commit and both commit independently, neither is an ancestor of the other — those commits are concurrent, which is exactly why Git asks you to merge (and sometimes flags a conflict). A merge commit with two parents is Git's "reconciliation": a new node that descends from both branches, just like Dynamo's D5.

Analogy: A vector clock and a Git history are two ways of writing the same story. The vector clock is the summary ("I've seen 2 of Sx's edits and 1 of Sy's"); the Git DAG is the full family tree drawn out with arrows. Both let you ask the same question — "is this version an ancestor, a descendant, or a cousin?" A cousin (concurrent / two siblings / a fork) is the one that needs a human or a merge rule.

A related modern idea is the CRDT (Conflict-free Replicated Data Type) — a special data structure designed so that concurrent updates always merge automatically, with no human reconciliation needed. The intuition: instead of just detecting a conflict (what vector clocks do) and dumping it on the app, a CRDT defines a merge rule baked into the data type itself that is guaranteed to give the same answer no matter what order updates arrive in. For example, a "grow-only set" merges by union — order doesn't matter, so concurrent adds never conflict. CRDTs often still use vector-clock-style version information internally to know what each replica has seen; they just add a mathematically safe automatic merge on top.

4.9 Trade-offs: the clock grows with the cluster

Vector clocks are not free. The vector has one slot per node that has ever written the object. In a small fixed cluster that's fine. But:

  • In a large system with many nodes, the vector gets long, and it travels with every message and every stored version — more bytes on the wire and on disk.
  • Worse, if you give each client its own slot (rather than each server), the vector can grow without limit as new clients appear. The Dynamo paper specifically warns that during failures or partitions, writes handled by nodes outside the normal set make the vector clock grow.

The fix is pruning: when the vector exceeds a threshold of entries, throw away the oldest (node, counter) pairs. Dynamo attaches a timestamp to each pair recording when that node last touched the object, and drops the least-recently-updated entries once the count crosses a limit (Dynamo used a threshold of 10; Riak's tunable limit sits in roughly the 20–50 range). Pruning trades a tiny, rare risk of an inaccurate causality judgement (you might fail to detect that two versions are actually related, producing an unnecessary sibling) for a bounded clock size. In practice this is a sound trade because the discarded entries are old and almost never relevant.

Common mistake: Using one vector-clock slot per client/actor instead of per server replica. With millions of clients the vector explodes and you get "false conflicts". This is exactly the trap dotted version vectors (DVV) were designed to escape; prefer per-replica vectors or DVV for client-facing writes.

Common mistakes & misconceptions

Common mistake: Believing C(a) < C(b) from a Lamport clock proves a caused b. It does not — the implication only runs the other way. Only a vector clock gives the "if and only if".
Common mistake: Comparing vector clocks by summing the slots or comparing them pairwise as single numbers. You must compare slot-by-slot; a difference where each clock leads in some slot means concurrent, not "the bigger sum wins".
Common mistake: Forgetting to increment your own slot on a receive. Some learners only do the element-wise max. The rule is increment-own-slot then max — the receive is itself a new local event and must be counted.
Common mistake: Treating concurrent versions ("siblings") as an error to silently auto-overwrite with "last write wins" by wall-clock time. That throws away a real, intentional update (e.g. an item silently vanishes from a shopping cart). Concurrency means reconcile, not discard.
Common mistake: Assuming vector clocks order all events into one line. They don't — they produce a partial order. Concurrent events are deliberately left unordered, because forcing an order would be a lie about causality.
Common mistake: Putting a slot per client and never pruning, then being surprised when clock metadata dwarfs the actual data and false siblings pile up. Bound the vector (pruning / DVV) and slot per replica.

Best practices

Best practice: Allocate one vector-clock slot per stable server replica, not per transient client, so the vector size is bounded by your cluster size, not your traffic.
Best practice: Always cap the vector with a pruning threshold and store a last-updated timestamp per entry so you can drop the oldest entries safely under failures and partitions.
Best practice: Make conflict resolution explicit and domain-aware. Decide the merge rule for each data type up front (union carts, max counters, latest-by-field) instead of relying on a blind last-write-wins.
Best practice: Where possible, model data as a CRDT so concurrent updates merge automatically and you never expose siblings to application code — reserve manual reconciliation for data that genuinely needs human judgement.
Best practice: When debugging "which write should have won?", log the full vector clock of each version. The slot-by-slot comparison tells you immediately whether you had a causal update or a true concurrent conflict.
Best practice: For client-facing writes that can race, prefer dotted version vectors (DVV) over naive client-id vector clocks to avoid unbounded growth and false siblings.

Section summary

  • Lamport clocks order events but cannot detect concurrency: C(a) < C(b) does not prove a → b.
  • A vector clock is a list of counters, one slot per node; your slot counts your events, other slots are your knowledge of theirs.
  • Three update rules: local event → increment own slot; send → increment own, attach vector; receive → increment own, then element-wise max with the message's vector.
  • Compare slot-by-slot: equal, happened-before (≤ in all slots and < in at least one), or concurrent (each leads somewhere).
  • Vector clocks satisfy the strong clock condition — a → b if and only if VC(a) < VC(b) — invented by Fidge and Mattern in 1988.
  • Concurrent versions are "siblings"; databases keep both and ask the application to reconcile them (semantic reconciliation).
  • Amazon Dynamo and Riak use vector/version vectors to detect conflicts; the classic D1→D5 shopping-cart example shows two concurrent writes merging into one descendant.
  • Git's commit DAG is a causality cousin: ancestor = happened-before, fork = concurrent, merge commit = reconciliation; CRDTs add automatic, conflict-free merging.
  • Trade-off: vector size grows with the number of writing nodes; bound it with pruning (drop oldest entries) or dotted version vectors.

Sources: Vector clock — Wikipedia, Dynamo: Amazon’s Highly Available Key-value Store (Riak docs mirror), Dotted Version Vectors (GSD, U. Minho), Vector Clocks Revisited Part 2: Dotted Version Vectors — Riak, Vector Clocks — Kevin Sookocheff, Logical Clocks — pk.org

Continue reading