← All topics
🌐

Distributed Systems

The hard middle of modern engineering — replication, partitioning, and failure.

20 posts · Engineering

  1. 1

    How to Use This Guide

    Consensus is the problem of getting a group of separate computers to agree on one value — one decision, one order of events — even though some of them may…

  2. 2

    The Consensus Problem

    This whole guide is about one deceptively small word: agree . You already know that in a distributed system there are many machines (we call them nodes ), the…

  3. 3

    Replicated State Machines & the Log

    This section gives you the one mental model that every practical consensus system — Raft, Paxos, ZooKeeper's ZAB, etcd, Consul, Google's Chubby and Spanner —…

  4. 4

    Raft — Leader Election

    Raft is an algorithm for keeping a replicated log in agreement across a cluster of servers. A "replicated log" is just an ordered list of commands (like "set x…

  5. 5

    Raft — Log Replication, Safety & Membership

    In the previous section you saw how Raft picks a single leader (the one server allowed to accept new commands) and how a term (a numbered period of time, each…

  6. 6

    Paxos — The Original Consensus Algorithm

    This is the algorithm that started it all. Paxos is a protocol that lets a group of unreliable machines agree on a single value, even when some of them crash,…

  7. 7

    Multi-Paxos, Raft vs Paxos & the Real World

    So far you have seen consensus as a way to agree on one value. In the previous sections, basic Paxos (single-decree Paxos) agrees on a single decision: one…

  8. 8

    Glossary of Terms

    Acceptor A Paxos role. Acceptors are the voting members; a value is chosen only when a majority of acceptors accept it.

  9. 9

    Frequently Asked Questions

    Q: Raft or Paxos — which should I actually use? A: For almost any new system, reach for a battle-tested Raft library (etcd's, or Hashicorp's) — Raft was…

  10. 10

    Revision Cheat Sheet

    One-line everything: Consensus = agree on one ordered log of commands, using a majority quorum, so identical replicated state machines stay in lockstep despite…

  11. 11

    How to Use This Guide

    Welcome. A distributed system is just a group of computers that work together to look like one single service to the people using it.

  12. 12

    What Is a Distributed System?

    Welcome to the very first step. You are a web developer, so you already know how to build software that runs on a computer.

  13. 13

    Why Distributed Systems Are Hard

    In the first section you learned what a distributed system is : a group of separate computers that work together over a network but appear, to the user, as one…

  14. 14

    Time, Clocks & the Ordering of Events

    One deceptively simple question: in what order did things happen? On a single computer this is trivial — you look at the clock, or you read the lines of a log…

  15. 15

    Vector Clocks & Causality

    In the previous section you met Lamport clocks — simple counters that give every event a number so we can put events in some order.

  16. 16

    The CAP Theorem (and PACELC)

    Once you split data across more than one machine, you run into a hard, unavoidable rule about what those machines can promise you.

  17. 17

    Consistency Models

    Imagine you save data in one place and it instantly appears everywhere. That is the dream. Reality is messier.

  18. 18

    Glossary of Terms

    Asynchronous network A network where messages can take any amount of time to arrive, with no guaranteed upper limit. The internet behaves this way.

  19. 19

    Frequently Asked Questions

    Q: What actually makes a system "distributed" — isn't every website on a server somewhere? A: A single server running everything is not distributed.

  20. 20

    Revision Cheat Sheet

    The 8 Fallacies of Distributed Computing (all are FALSE) The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn't…