What Is a Distributed System?

By Pritesh Yadav 19 min read

Welcome to the very first step. You are a web developer, so you already know how to build software that runs on a computer. This section answers one question in plain English: what happens when one computer is not enough, and we make many computers work together as a team? That team is called a distributed system. By the end of this section you will understand what one is, why people build them, and the basic words used to talk about them. We will go slowly and define every term before we use it.

1.1 A plain-English definition

A distributed system is a group of separate computers that work together over a network and try to look, to the user, like a single computer.

Let's break that sentence into its three important parts:

  • Separate computers — these are physically different machines. They each have their own processor, their own memory, and their own copy of the program. One could be in New York and another in Tokyo. They do not share memory the way two programs on your laptop do.
  • Work together over a network — a network is just the wiring (cables, Wi-Fi, the internet) that lets computers send messages to each other. The only way these separate computers can cooperate is by sending each other messages, like passing notes.
  • Look like a single computer to the user — this is the magic goal. When you open Gmail, you do not think "I am now talking to machine number 4,812 in a data center in Oregon." You just think "I am using Gmail." The fact that thousands of machines are involved is hidden from you.

The textbook version of this definition, from a well-known distributed-systems book by Andrew Tanenbaum and Maarten van Steen, says it well: "a collection of autonomous computing elements that appears to its users as a single coherent system." "Autonomous" means each computer can run on its own; "coherent" means it all behaves like one sensible whole.

There is also a famous, funny definition by computer scientist Leslie Lamport (a pioneer of the field): "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." This joke captures a deep truth we will return to again and again: in a distributed system, your work can be affected by far-away machines you cannot see and did not know about. Things break in surprising ways.

Analogy: Think of a large restaurant kitchen. To the diner in the dining room, there is "the kitchen" — one thing that produces their meal. But behind the door there are many separate cooks: one on the grill, one on salads, one plating desserts, one washing dishes. They talk to each other by shouting orders ("two steaks, medium!"). They are separate people doing separate jobs, but the diner experiences a single, smooth kitchen. A distributed system is that kitchen: many workers (computers) coordinating by passing messages, presenting one face to the customer.

1.2 Single computer vs. distributed system: what actually changes?

You have spent your career so far mostly thinking about a single computer — one machine running your code. Everything in that world is comfortable and predictable. When you move to many machines, a few comfortable assumptions disappear. This shift is the heart of why distributed systems are hard.

TopicSingle computerDistributed system (many computers)
How parts share dataShared memory — instant and reliable. One function calls another in nanoseconds.Messages over a network — slow and can get lost, delayed, or duplicated.
Clock / timeOne clock. "Now" means one thing.Every machine has its own clock, and they drift apart. There is no single "now."
FailureIf it crashes, the whole thing stops. Simple to reason about.One machine can fail while others keep running. The system is partly alive, partly dead — much harder to reason about.
Knowing the stateYou can look at memory and see the exact current state.No machine knows the full, up-to-date state of all the others. Information is always a little stale.
Growth limitLimited by how big and fast one machine can be.Add more machines to grow almost without limit.

The single most important change is this: communication is now by messages over a network, and the network is unreliable. Messages can be slow, arrive out of order, get lost entirely, or arrive twice. On one machine, calling a function always works and always returns instantly. Across machines, a request might never get an answer — and you cannot tell whether the other machine did the work, is just slow, or has died. This single fact is the source of most distributed-systems difficulty, and the rest of this study guide is largely about coping with it.

Key takeaway: Going from one computer to many is not just "more of the same." You lose three comfortable things: shared memory, a single clock, and all-or-nothing failure. In exchange, you gain the ability to grow, to survive failures, and to be fast for users everywhere. The whole field is about managing that trade.

1.3 Everyday real examples you already use

Distributed systems are not exotic. You use dozens every day, often without realizing it. Here are concrete ones:

  • Any large web app (Google Search, YouTube, Netflix, Amazon) — when you search Google, your one query is handled by thousands of machines splitting the work, then combining results in a fraction of a second. No single machine could hold all of Google's index or serve all its users.
  • WhatsApp / messaging apps — your message travels from your phone, through many servers in data centers, to your friend's phone, which may be on the other side of the planet. Servers store the message if your friend is offline and deliver it later. Many machines cooperate to make "send a message" feel instant.
  • Online banking — your balance, transactions, and fraud checks are spread across many machines and data centers, replicated so that if one fails, your money records are not lost. Banks legally cannot afford to lose data, so they keep multiple copies on multiple machines.
  • DNS (the Domain Name System) — this is the internet's phone book that turns a name like google.com into a numeric address. It is one of the oldest and largest distributed systems in the world. There are 13 named "root server" addresses, but behind them sit roughly 1,900+ physical server instances spread across every populated continent, all answering as if they were one. (We'll use a clever trick called anycast here — more on that later in the guide.)
  • Git — the version-control tool you use is a distributed system too. Every developer's laptop has a full copy of the repository's history. There is no single "master" computer that must be online; people sync copies by pushing and pulling. This is a distributed system you literally hold in your hands.
  • Cassandra and Amazon DynamoDB (databases) — these are databases designed from the ground up to run across many machines. Both trace their design back to a famous 2007 Amazon research paper called "Dynamo: Amazon's Highly Available Key-value Store." They store your data on several machines at once so the database keeps working even when individual machines die. We'll meet them repeatedly throughout this guide.
Example: Imagine you tweet a photo. (1) Your phone uploads it to a nearby server. (2) That server stores copies on several storage machines in different buildings, so losing one building loses nothing. (3) A separate fleet of machines resizes the photo into thumbnails. (4) Another fleet updates your followers' timelines. (5) When a follower in another country opens the app, a server near them serves the photo quickly. Dozens of machines across the planet cooperated for one tweet — and to you it felt like pressing one button.

1.4 Why do we build them? The four big reasons

Distributed systems are harder to build than single-machine programs. So why bother? There are four core motivations. Almost every distributed system exists because of at least one of these.

Reason 1: Scalability — handle more load

Load means the amount of work coming in: number of users, requests per second, or amount of data. A single machine has a ceiling — only so much processor, memory, and disk. Scalability is the ability to handle more load by adding more machines. Instead of buying one impossibly huge computer, you add many ordinary ones and spread the work across them. This is how a startup with 100 users can grow to 100 million users without rewriting everything.

Reason 2: Fault tolerance & high availability — survive failures

Machines break. Disks die, power fails, cables get cut, software crashes. On a single machine, one failure means total downtime. A distributed system keeps several machines doing the same job, so if one dies, others carry on and the user never notices. Fault tolerance means the system keeps working correctly even when parts fail. High availability means the system is up and answering almost all the time (often measured as "how many minutes per year is it down?"). The trick is having no single point that, if it breaks, takes everything down with it.

Reason 3: Low latency — serve users close to them

Latency is the delay between asking for something and getting the answer — how long you wait. Physics sets a hard limit: data cannot travel faster than light, so a request from London to a server in Sydney takes real, noticeable time just to make the round trip. The fix is to put copies of your system in many places around the world, so each user talks to a machine physically near them. A user in Sydney hits a Sydney machine; a user in London hits a London machine. Both feel fast. (This is exactly what DNS's anycast and Netflix's regional servers do.)

Reason 4: Data too big for one machine

Sometimes the data itself simply will not fit on one computer — Google's web index or Facebook's photos are far larger than any single disk. The only option is to split the data across many machines, each holding a slice. This splitting is so central it has its own vocabulary (partitioning/sharding), which we define just below.

Analogy: Picture a popular bakery. Scalability: when lines grow, you add more counters and bakers rather than asking one baker to move faster. Fault tolerance: if one baker calls in sick, the others keep serving — the shop doesn't close. Low latency: instead of one giant bakery in one city, you open branches in every neighborhood so customers don't travel far. Too big for one: you can't store a year of flour in one room, so you keep stock spread across several warehouses. Every reason to distribute a system maps onto a reason to run more than one bakery.

1.5 Core vocabulary — defined simply

You'll see these words constantly. Learn them now and the rest of the guide reads easily. Each is defined in plain words first, then with a tiny example.

TermPlain meaningTiny example
NodeOne computer (or one running process) that is part of the system. The basic building block.One server in a data center is a node.
ClusterA group of nodes working together as one system."Our database cluster has 6 nodes."
ServerA node that provides a service (answers requests).The machine that returns web pages.
ClientThe thing asking the server for something.Your browser or phone app.
ReplicaA copy of the same data kept on another node, for safety and speed. "Replication" = making such copies.Your bank balance stored on 3 machines; each is a replica.
Partition / ShardA slice of the data, so different nodes hold different pieces. "Sharding" = splitting data into slices. (Warning: "partition" has a second meaning — see the mistake box.)Users A–M on node 1, users N–Z on node 2.
LatencyHow long one request takes — the wait time."This page loads in 40 milliseconds."
ThroughputHow much work the system does per second — the volume."It handles 50,000 requests per second."
AvailabilityThe fraction of time the system is up and answering."99.9% available" ≈ down ~8.7 hours/year.
FaultSomething going wrong underneath (a disk error, a dropped message). A fault may or may not cause a visible problem.A network cable briefly disconnects.
FailureWhen the system actually fails to do its job, as seen by the user. A good distributed system tolerates faults so they don't become failures.The user gets an error instead of their page.

Two pairs are worth pausing on because beginners mix them up:

  • Latency vs. throughput. Latency is about one task's delay (how long you wait). Throughput is about many tasks (how many get done per second overall). A highway analogy: latency is how long your car takes to drive the road; throughput is how many cars pass per minute. You can improve one without the other.
  • Fault vs. failure. A fault is a problem in a part. A failure is when the whole system can't deliver. The entire point of distributed systems engineering is to stop faults from becoming failures — to absorb the broken part so the user never feels it.

1.6 What a distributed system is trying to achieve (its goals)

When engineers design a distributed system, they aim for a set of properties. Here are the main ones, in plain words:

  • Transparency — hide the messy distributed details from the user. The user should not have to know how many machines there are, where they are, or that data is copied around. "It just works like one thing." (Forms of transparency include hiding location, hiding replication, and hiding failures when a node dies.)
  • Scalability — the system should grow gracefully as load grows, by adding nodes, without a rewrite and without slowing to a crawl.
  • Reliability — the system does the correct thing and does not lose or corrupt data, even when parts misbehave.
  • Availability — the system is up and answering whenever users need it. (Reliability is "right answers"; availability is "an answer at all." They are related but not identical — a system can be available but wrong, or correct but down.)
  • Fault tolerance — it keeps running through individual failures, which is how it achieves availability in the real world where machines constantly break.

These goals often pull against each other — making something more available can make it harder to keep perfectly consistent, for example. Much of this guide is about understanding those trade-offs (you'll later meet the famous CAP theorem, which is exactly about one such trade). For now, just hold the goals loosely in mind.

1.7 The mental picture: nodes talking over a network

Here is the simplest possible diagram of a distributed system. Several nodes, each its own computer, connected by a network, with clients talking to them. Burn this picture into your mind — almost every diagram later in the guide is a richer version of it.

                     CLIENTS (browsers, phones, apps)
                  ┌──────────┬──────────┬──────────┐
                  │          │          │          │
                  ▼          ▼          ▼          ▼
            ┌───────────────────────────────────────────┐
            │              THE  NETWORK                  │
            │   (messages: may be slow / lost / reordered)│
            └───────────────────────────────────────────┘
                  ▲          ▲          ▲          ▲
                  │          │          │          │
              ┌───┴───┐  ┌───┴───┐  ┌───┴───┐  ┌───┴───┐
              │ Node 1│  │ Node 2│  │ Node 3│  │ Node 4│
              │(server│  │(server│  │(server│  │(server│
              │  +data)  │  +data)  │  +data)  │  +data)
              └───┬───┘  └───┬───┘  └───┬───┘  └───┬───┘
                  └──────────┴────┬─────┴──────────┘
                                  │
                      nodes also message EACH OTHER
                      to stay coordinated (gossip,
                      replicate data, elect a leader…)

Two things to notice. First, the clients see "one service," but it's really four nodes. Second, the nodes don't only talk to clients — they talk to each other across the same unreliable network, to copy data and stay in sync. That node-to-node chatter is where most of the hard, interesting problems live (and where the rest of this study guide spends its time).

Now here's a quick timeline diagram showing the unreliable-network reality — the thing that makes all of this hard. A client asks a node to save something; the network can betray us in several ways:

  CLIENT                     NETWORK                     NODE
    │                                                      │
    │  ──── "save X" ───────────────────────────────────▶ │  (1) works: arrives,
    │                                                      │      node saves, replies
    │  ◀──────────────────────── "ok, saved" ──────────── │
    │                                                      │
    │  ──── "save Y" ───────────────────────X             │  (2) lost on the way:
    │                (message dropped, no reply ever)      │      node never heard it
    │                                                      │
    │  ──── "save Z" ───────────────────────────────────▶ │  (3) node saved it,
    │                                                      │      but the REPLY is lost…
    │                  X ◀────────────────── "ok, saved"   │
    │   (client waits, times out — did it save? unknown!)  │
    │                                                      │
  In case (3) the client CANNOT TELL the difference from case (2).
  This uncertainty is the central challenge of distributed systems.
Key takeaway: A distributed system is many nodes cooperating over an unreliable network to look like one system. You build them for scalability, fault tolerance/availability, low latency, and data that's too big for one machine. The price you pay is uncertainty — you often cannot tell whether a remote action happened. Everything else in this guide is tools for living with that uncertainty.

Common mistakes & misconceptions

Common mistake: Thinking "distributed" just means "running on a server." A single web server with a single database is not a distributed system. It becomes one only when multiple cooperating machines must coordinate and present themselves as one. The defining feature is multiple nodes coordinating over a network, not merely "code that runs in the cloud."
Common mistake: Confusing latency and throughput. Latency is how long one request waits; throughput is how many requests finish per second. A system can have high throughput (handles tons of traffic) yet bad latency (each user still waits a long time), or the reverse. Always ask which one a metric is talking about.
Common mistake: Treating fault and failure as the same word. A fault is a broken part (one disk, one dropped packet); a failure is the whole system letting the user down. Good design absorbs faults so they never become failures. Saying "we had a fault" is very different from "we had a failure."
Common mistake: Mixing up the two meanings of "partition." In storage it means a shard — a slice of data on a node. In networking (which you'll meet in the CAP theorem) it means a network partition — when the network splits so some nodes can't reach others. Same word, completely different idea. Always check the context.
Common mistake: Assuming the network is fast and reliable like a function call. It is not. Messages get lost, delayed, duplicated, and reordered, and you often can't tell a dead node from a slow one. Treating remote calls as if they were local in-memory calls is the single most common beginner error — and the source of countless real outages.
Common mistake: Believing distribution is always better. It adds real complexity — coordination, partial failures, debugging across machines. If one machine comfortably does the job, one machine is the right answer. Distribute only when a concrete reason (scale, availability, latency, data size) forces you to.

Best practices

Best practice: Start with the simplest design that works — often a single machine — and distribute only when a measured need (too much load, not enough uptime, users too far away, data too big) actually appears. Complexity should be earned, not assumed.
Best practice: Always design assuming the network and other nodes will fail. Ask of every remote call: "What happens if I never get a reply? What if it ran twice?" Planning for failure from day one is what separates systems that survive from systems that surprise you.
Best practice: Be explicit about which goal you are optimizing — scalability, availability, low latency, or strong correctness. These goals trade off against each other, so name the one that matters most for your feature before you choose a design.
Best practice: Use precise vocabulary in design discussions. Say "node," "replica," "shard," "latency," "throughput," "fault," "failure" deliberately and correctly. Sloppy words lead to sloppy designs; shared, exact language is half of getting distributed systems right.
Best practice: Learn from battle-tested systems instead of inventing from scratch. The patterns in Cassandra, DynamoDB, etcd, Kafka, and DNS solve problems you will hit too — when in doubt, study how a proven system handled it before rolling your own.

Section summary

  • A distributed system is many separate computers cooperating over a network to appear as one coherent system to the user.
  • Moving from one computer to many removes three comforts: shared memory, a single clock, and all-or-nothing failure — and adds an unreliable network where messages can be lost, delayed, duplicated, or reordered.
  • You use them everywhere: Google, WhatsApp, online banking, DNS (13 named root addresses, ~1,900+ real servers worldwide), Git, and databases like Cassandra and DynamoDB (both descended from Amazon's 2007 Dynamo paper).
  • We build them for four reasons: scalability (more load), fault tolerance / high availability (survive failures), low latency (serve users nearby), and data too big for one machine.
  • Core vocabulary: node (one computer), cluster (group of nodes), server/client, replica (a data copy), partition/shard (a data slice), latency (one request's wait), throughput (requests per second), availability (uptime fraction).
  • A fault is a broken part; a failure is the whole system letting the user down — good design stops the first from becoming the second.
  • The design goals are transparency, scalability, reliability, availability, and fault tolerance — and they trade off against one another.
  • The mental picture: nodes talking to clients and to each other over an unreliable network; the deep difficulty is that you often cannot tell whether a remote action succeeded.
  • Distribution is a tool with a cost — use it only when a real need forces it, and always design assuming failure.

Continue reading