Concurrency & Parallelism: Doing Many Things at Once
Modern programs rarely do just one thing at a time. A web server answers thousands of users at once. Your phone plays music while downloading a file while checking for messages. This section explains how computers juggle many tasks, the bugs that show up when they do, and the tools that keep things correct. We start from the very basics and build up.
1. The Big Distinction: Concurrency vs. Parallelism
People use these two words as if they mean the same thing. They do not. Getting this clear early will make everything else easier.
Concurrency is about structure: organizing a program so that many tasks can make progress over overlapping time periods. "Make progress over overlapping time" means each task moves forward a little, then the next one does, then back to the first. They are all "in flight" at the same time, but at any single instant only one might actually be running. A computer with a single CPU core (a "core" is one processing unit that runs instructions) does this by time-slicing — switching between tasks so fast it looks simultaneous.
Parallelism is about execution: literally running multiple tasks at the exact same instant. This requires multiple cores. Two cores can each run a task in the same nanosecond.
Rob Pike, the designer of the Go programming language, put it memorably: "Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." Concurrency is how you structure a program; parallelism is a property of the hardware running it.
2. Why Bother? It Depends on What's Slowing You Down
The right concurrency tool depends entirely on what your program spends its time doing. There are two flavors of work:
- I/O-bound work: "I/O" means input/output — reading a disk, calling a network service, querying a database. The bottleneck is waiting. While the program waits for a database to reply (say 100 milliseconds), the CPU sits idle doing nothing. Concurrency wins big here: during the wait, let another task use the CPU. This works even on a single core.
- CPU-bound work: The bottleneck is computation — hashing passwords, resizing images, crunching numbers. There is no waiting to hide; the CPU is busy the whole time. Concurrency alone gives you nothing here. To go faster you need true parallelism: more cores doing real work simultaneously.
| Aspect | I/O-bound | CPU-bound |
|---|---|---|
| Bottleneck | Waiting (disk, network, DB) | Calculation |
| CPU busy? | Mostly idle, waiting | Fully busy |
| Best tool | Async / event-loop or threads | Multiple processes / parallel threads |
| Helped by more cores? | Not much — hiding waits is enough | Yes — that's the whole point |
3. Three Ways to Do Many Things at Once
There are three common models. Define each carefully.
- Processes: A process is an independent running program with its own private memory. Two processes cannot accidentally corrupt each other's data because they are isolated. If one crashes, the others survive. The downside: they are heavyweight (slow to create) and talking between them needs IPC (inter-process communication — pipes, sockets, shared-memory segments), which is more work.
- Threads: A thread is an execution stream inside a process. Multiple threads in one process share the same memory. They are lightweight and fast to create, and they communicate just by reading/writing shared variables. But that shared memory is exactly what causes the bugs we'll see next. The operating system's scheduler (the part that decides which thread runs) can preempt — interrupt — a thread at almost any instruction.
- Async / event-loop: A single thread runs an event loop — a loop that picks up ready tasks and runs them a piece at a time. Tasks cooperatively give up control at marked pause points (often the keyword
await). Because a task is never interrupted mid-step, there are far fewer races. But one greedy task that never yields freezes everything.
How Node.js does it (a concrete real example)
Node.js (a JavaScript runtime) runs your JavaScript on a single thread with an event loop. Underneath sits a C library called libuv. For network I/O, libuv asks the operating system's native async facilities (epoll on Linux, kqueue on macOS, IOCP on Windows) to notify it when data is ready — no extra threads needed. But some operations can't be done async by the OS (file system access, crypto hashing, DNS lookups). For those, libuv uses a small thread pool of 4 threads by default (you can change it with the UV_THREADPOOL_SIZE setting). When a worker finishes, its callback is queued back onto the event loop. So "single-threaded Node" quietly has worker threads for the blocking jobs.
4. The Core Danger: Race Conditions
A race condition is when the correctness of your program depends on the timing of how threads interleave — and some interleavings produce wrong answers. The name comes from threads "racing" each other to touch shared data.
The classic example everyone must understand is the lost update. Consider counter++ (add one to a shared counter). It looks like one step, but the CPU actually does three: READ the value from memory, MODIFY it (add 1) in a register, and WRITE it back. With two threads and counter = 0:
Thread A Thread B counter in memory
READ (sees 0) 0
READ (sees 0) 0
ADD -> 1 0
ADD -> 1 0
WRITE 1 1
WRITE 1 1 <-- should be 2!
Thread B read the old value (0) before A wrote its result. One increment vanished. Run two threads each incrementing a shared counter 1,000,000 times and you will get a final number less than 2,000,000 — and a different number each run.
- The critical section is the code that touches shared data and must not be interleaved (here, the read-modify-write).
- Mutual exclusion is the guarantee that only one thread at a time may be inside the critical section.
counter++, i++, list.append, or a dictionary update is "atomic" (indivisible). It usually is not. Before teaching anyone the fix, reproduce the lost update — seeing <2,000,000 makes it real.5. The Toolbox: Synchronization Primitives
These are the low-level tools that keep shared state safe.
- Lock / Mutex ("mutual exclusion"): a thread calls
acquire()before the critical section andrelease()after. Only the holder proceeds; everyone else waits. Wrapcounter++in a lock and the result is always 2,000,000. - Semaphore: a counter that permits up to N threads at once. A mutex is just a semaphore with N=1. Use it to cap concurrency, e.g. "at most 10 simultaneous database connections."
- Atomic operations: hardware-guaranteed indivisible operations, such as compare-and-swap (CAS) or atomic fetch-add. An atomic increment can't be split apart, so it fixes the counter without a lock — faster for simple cases.
- Memory barriers / visibility: modern CPUs cache values in per-core caches and reorder instructions for speed. So even with no torn write, one thread's update may not be visible to another. A memory barrier (or the guarantees built into a language's memory model) forces the update to be published across cores. This is why Java has
volatile, C++ hasstd::atomic, and Go ships a race detector. Locks and atomics also act as barriers.
volatile, or barrier. Visibility is not free.6. The Three Classic Failures
Deadlock is when two or more threads each hold a resource the other needs, and nobody lets go — frozen forever. It requires all four Coffman conditions at once:
- Mutual Exclusion — a resource can't be shared.
- Hold and Wait — a thread holds one resource while waiting for another.
- No Preemption — resources are only given up voluntarily.
- Circular Wait — a cycle of waiting, P1 waits on P2 waits on … waits on P1.
Break any one and deadlock becomes impossible.
Thread A holds Lock1, wants Lock2
Thread B holds Lock2, wants Lock1
A --waits for--> Lock2 --held by--> B
^ |
| v
Lock1 <--held by-- A B --waits for--> Lock1
(a circle = circular wait = stuck)
FIX: everyone always grabs Lock1 BEFORE Lock2.
No circle can form. Deadlock gone.
Livelock: threads are not blocked — they are busy running and changing state — but make no real progress. Often caused by overly polite retry/back-off logic.
Starvation: a thread is perpetually denied a resource because others keep cutting ahead (e.g., a low-priority thread that never gets scheduled, or a writer blocked forever by a stream of readers). The fix is fairness policies — fair locks that take turns.
7. Amdahl's Law: The Hard Ceiling on Speedup
Adding more cores does not give unlimited speedup. Amdahl's Law tells you the maximum. Let p be the fraction of your program that can run in parallel, and s the number of processors:
Speedup = 1 / ( (1 − p) + p/s )
As s grows toward infinity, p/s shrinks to zero, so speedup approaches 1 / (1 − p). The serial (non-parallel) part sets a hard cap.
The optimistic counterpoint is Gustafson's Law: in the real world, when you have more cores you usually tackle a bigger problem, and the parallel fraction grows with problem size. So practical scaling can beat Amdahl's fixed-size pessimism.
8. Higher-Level Models (Where Modern Code Lives)
Hand-written locks are error-prone, so most modern systems prefer safer abstractions:
- Message passing / Actors: no shared memory at all. Each "actor" has private state and a mailbox, and they only communicate by sending immutable messages. An actor handles one message at a time, so its state needs no locks. (Erlang, Elixir, Akka.) This eliminates whole classes of races.
- CSP / Channels (Go): "Communicating Sequential Processes." Lightweight goroutines (tiny concurrent functions) pass values through channels instead of touching shared memory. Go's motto: "Don't communicate by sharing memory; share memory by communicating." A channel both moves data and synchronizes — a send waits until a receiver is ready.
- async / await event loops (Node, Python
asyncio): cooperative single-threaded concurrency, excellent for I/O-bound work.awaityields control during a wait so other tasks run. - Thread pools: reuse a fixed set of worker threads instead of spawning a new one per task. Tasks go on a queue; idle workers pick them up. Creating threads is expensive and unbounded threads exhaust memory.
- Producer-consumer + backpressure: the backbone of server work-dispatch.
[Producers] ---> [ bounded thread-safe queue ] ---> [Consumers] queue FULL => producers block (this is "backpressure") queue EMPTY => consumers block (wait for work)
Backpressure means the queue's fixed size automatically slows fast producers when consumers can't keep up — a built-in safety valve.
9. Thread-Safety & Immutability — the Cheapest Win
Code is thread-safe if it behaves correctly under concurrent access without extra synchronization. The cheapest way to get there is immutability: if data never changes after it is created, there is nothing to race on — no locks needed, ever. This is why functional and message-passing styles scale so well.
10. Python's GIL (a Real-World Wrinkle)
CPython (the standard Python) has historically had a Global Interpreter Lock (GIL) — a single lock that lets only one thread run Python bytecode at a time. So Python threads do not give CPU-bound parallelism; for that you use the multiprocessing module (separate processes, separate memory, no shared GIL). But threads do help I/O-bound work, because the GIL is released while a thread waits on I/O.
Current status (2025/2026): PEP 703 (the "free-threaded," no-GIL proposal) was accepted in 2023. Python 3.13 (October 2024) shipped an experimental free-threaded build. Python 3.14 (2025) moved free-threading to officially supported status (via PEP 779). It is still opt-in, not the default, carries a roughly 5–10% single-thread overhead, and default-on remains years away.
multiprocessing or the free-threaded build instead.11. Tying It Together: A Web Server With 10,000 Requests
A server receiving 10,000 simultaneous requests, each waiting about 100ms on a database, is I/O-bound (mostly waiting). Three approaches:
- Thread-per-request: spawn 10,000 threads. It works, but threads are heavyweight — that's a lot of memory and constant context-switching (the cost of the OS swapping threads on the CPU).
- Thread pool: a bounded set of workers pulls requests off a queue (producer-consumer). Stable memory, controlled concurrency.
- Async event loop: one or a few threads juggle all 10,000 using non-blocking I/O. This scales far better when the work is mostly waiting.
This is the famous C10k problem (handling 10,000 concurrent connections) and why async-first servers — nginx, Node.js, Go — dominate I/O-heavy workloads.
- Concurrency is structure (dealing with many tasks); parallelism is execution (running them at once). Concurrency enables but doesn't require parallelism.
- Match the tool to the bottleneck: async/event-loop or threads for I/O-bound (hide waiting), multiple processes/parallel threads for CPU-bound (need more cores).
- Shared mutable memory causes race conditions;
counter++is a read-modify-write and can lose updates. Protect critical sections with locks, atomics, or higher-level models — and immutability avoids the problem entirely. - Deadlock needs all four Coffman conditions; break one (lock ordering breaks circular wait). Watch also for livelock (busy, no progress) and starvation (perpetually denied).
- Amdahl's Law caps speedup at 1/(1−p): a 90%-parallel program tops out at 10× no matter how many cores. Profile and shrink the serial part first.
- Prefer channels/actors/queues over hand-rolled locks; bound your thread pools; enable race detectors in CI; remember Python's GIL serializes CPU-bound threads (free-threading is supported but not default as of 3.14).