Frequently Asked Questions

By Pritesh Yadav 7 min read

Q: What actually makes a system "distributed" — isn't every website on a server somewhere?

A: A single server running everything is not distributed. It becomes distributed the moment two or more separate machines (nodes) cooperate to provide one service. The defining feature is that those machines can only learn about each other by sending messages over a network — and that network can be slow, drop messages, or split in two. That single fact is where all the difficulty comes from.

Q: Why can't I just put a timestamp on everything and sort by time?

A: Because every machine's clock is slightly wrong, and they are all wrong by different amounts (clock skew). Two events a millisecond apart on two machines can easily get timestamps that are in the wrong order. So sorting by wall-clock time can tell you event B came "before" event A even when A actually caused B. Logical clocks exist precisely to order events correctly without trusting the wall clock.

Q: If NTP keeps clocks in sync, why is clock skew still a problem?

A: NTP narrows the gap — it does not close it. After syncing, machines are still typically a few milliseconds apart, and network delay means the correction itself arrives late. A few milliseconds is an eternity in computing; thousands of events can happen in that window. NTP makes timestamps "good enough for humans," not "good enough to order events safely."

Q: Lamport clocks vs vector clocks — which should I use?

A: Use a Lamport clock when you only need to put events into one agreed order and you do not care why. Use a vector clock when you need to know whether two events were truly causally related or just happened independently (concurrent) — for example, to detect and resolve write conflicts. The trade-off: a Lamport clock is one number; a vector clock is one number per node, so it costs more space and grows with the cluster.

Q: What's the real difference between Lamport and vector clocks in one sentence?

A: A Lamport clock guarantees "if A caused B then A's number is smaller," but a smaller number does not prove causation; a vector clock can actually tell you "A caused B," "B caused A," or "they're concurrent" with certainty.

Q: Is the CAP theorem still relevant, or is it outdated?

A: It is still relevant, but it is often quoted too simply. CAP only describes what happens during a network partition — and even then it is not a clean "pick 2 of 3," because partition tolerance is mandatory on real networks. So the real choice is just C vs A while the network is broken. PACELC is the more honest, modern framing because it also describes the trade-off (latency vs consistency) when nothing is broken — which is most of the time.

Q: Does "choosing availability" mean my data gets corrupted?

A: No. It means that during a network split, different parts of the system may temporarily show different (stale) values, and they may have to reconcile conflicting writes afterward. The data is not corrupt — it is just temporarily out of agreement. Choosing consistency instead means some requests get refused during the split rather than risk showing stale data.

Q: Does "eventually consistent" mean my data is wrong?

A: Not wrong — just possibly not the latest yet. Eventual consistency promises that if writes stop, every copy will converge to the same correct value. In the meantime a read might be stale (an old-but-valid value). For things like a "likes" count this is fine; for a bank balance you would want a stronger model. The data is never garbage, it is just behind.

Q: How stale can an eventually consistent read actually be?

A: The basic definition gives no time bound at all — "eventually" could in theory be a long time, though in practice good systems converge in milliseconds to seconds. If you need a promise like "no more than 5 seconds behind," you need a stronger model (such as bounded staleness) — plain eventual consistency does not provide one.

Q: What's the difference between a fault and a failure? They sound the same.

A: A fault is one thing going wrong locally — a disk dies, a packet is lost, a node freezes. A failure is when the whole system stops doing its job. The entire craft of distributed systems is stopping faults from turning into failures (this is "fault tolerance"). Faults are normal and constant; failures are what you design to prevent.

Q: Why do retries need to be "idempotent"? Can't I just resend a request?

A: Because on an unreliable network you often cannot tell whether your request failed or whether just the reply got lost. If you resend a non-idempotent request like "charge $10," you might charge twice. Idempotency means doing the operation again has no extra effect — so retrying is always safe. This is why so many systems use idempotency keys.

Q: Is linearizability the same as "the database is fast" or "ACID"?

A: No. Linearizability is about freshness and ordering of a single object — the system behaves as if there is one copy and every operation happens at one instant. It says nothing about speed (it usually costs latency) and it is not the same as ACID transactions, which are about multi-step operations being all-or-nothing and isolated. They are different guarantees that often appear together but are not interchangeable.

Q: Strong consistency sounds best — why doesn't everyone just use it?

A: Because it costs latency and availability. To guarantee every read is current, nodes must coordinate before answering, which is slower, and during a network partition they may have to refuse requests entirely (the CAP choice). For data where stale-by-a-moment is harmless, weaker consistency gives a faster, more available system. You pick the weakest model that is still correct for your use case.

Q: What is a quorum and why "more than half"?

A: A quorum is the minimum number of nodes that must agree for a decision to count. "More than half" is popular because two different majorities can never exist at the same time — so the system can never accidentally make two conflicting decisions, even if the network splits. This is how systems stay safe without needing every single node to respond.

Q: I'm a beginner — which of these six topics matters most in practice?

A: For day-to-day reasoning, the most useful ideas are the fallacies (Section 2, they keep you humble about the network), consistency models (Section 6, they tell you what your database actually promises), and CAP/PACELC (Section 5, the trade-offs you will discuss in design reviews). Clocks and causality (Sections 3–4) are the deeper foundation that makes the others click, so don't skip them — but the three above are what you will reach for first.

Q: Do I need to memorise all eight fallacies?

A: It helps, but the spirit matters more than the list. The one to truly internalise is "the network is reliable" is false — once you stop assuming messages always arrive, on time, in order, the rest follow naturally. Treat the network as slow, lossy, and occasionally split, and you will design defensively without reciting the list.

Continue reading