How a Computer Runs Your Program: CPU, Memory, Processes & Threads
You write code in a language like Python, Java, or JavaScript. But a computer's brain — the CPU (Central Processing Unit, the chip that does all the calculating) — does not understand any of those languages. It understands only machine code: long strings of binary numbers (1s and 0s) that are instructions for one specific kind of chip. This section follows your program all the way down: how your text becomes machine code, how the CPU runs it, where the data lives, and how the operating system juggles many programs at once. Understanding this layer is what separates a developer who guesses at performance from one who knows.
2.1 From Source Code to Something the CPU Can Run
Your source code (the text you type) must be translated into machine code. There are three main strategies, and real languages mix them.
1. Compilation (AOT — ahead-of-time). A compiler (a program that translates code) turns your whole program into machine code before it runs. Languages like C, C++, Rust, and Go work this way. The result is a standalone executable file that is fast and needs no translator at run time — but it is tied to one type of chip and operating system. Many errors are caught during compiling, before the program ever runs.
2. Interpretation. An interpreter reads your program and runs it one statement at a time, translating as it goes. This is flexible and portable (the same code runs anywhere the interpreter exists), but slower, because the code is re-analyzed every time it runs.
3. Bytecode + a virtual machine. Languages like Java and Python first compile to bytecode — a compact, portable intermediate code that is not machine code for any real chip. Bytecode runs on a virtual machine (VM), which is a piece of software that pretends to be a CPU. Java's javac tool produces .class bytecode that the JVM (Java Virtual Machine) executes. Python produces .pyc bytecode that CPython runs. This gives "write once, run anywhere."
4. JIT (just-in-time compilation). Modern VMs are clever. They start by interpreting bytecode, watch which functions run over and over (these are called "hot" paths), and then compile just those into native machine code while the program is still running. The JVM's HotSpot engine and every modern JavaScript engine (like Google's V8) do this. Because a JIT can see real data as the program runs, it can sometimes produce code faster than a plain ahead-of-time compiler.
C: hello.c --compiler+linker--> native executable --> CPU
Java: Hello.java --javac--> .class bytecode --> JVM
(interprets, then JIT-compiles hot methods)
Python: hello.py --> .pyc bytecode --> CPython interprets
2.2 The CPU and the Fetch–Decode–Execute Cycle
At its heart, a CPU does one thing in a loop, forever:
- Fetch — read the next instruction from memory. The CPU keeps the address of that instruction in a special slot called the program counter (PC), also called the instruction pointer.
- Decode — work out what the instruction means and what data it needs.
- Execute — the ALU (Arithmetic Logic Unit, the part that does math and logic) performs the operation; the result goes into a register or memory.
- Advance the PC to the next instruction (or jump elsewhere), then repeat.
This loop is driven by the clock, an electronic pulse that ticks billions of times per second. A 3 GHz CPU ticks 3 billion times per second, so one tick (cycle) takes about 0.3 nanoseconds. (A nanosecond is one-billionth of a second.)
Real CPUs are far cleverer than "one instruction at a time." Pipelining overlaps the fetch/decode/execute stages of several instructions, like an assembly line. Superscalar chips run several instructions in the same cycle. Out-of-order and speculative execution, helped by branch prediction (guessing which way an if will go), keep the assembly line full. But fetch-decode-execute is the correct mental model to start from.
Registers — the fastest storage that exists
Registers are a tiny set of storage slots inside the CPU itself. They are the fastest memory there is, read in well under one clock cycle. An x86-64 chip has about 16 general-purpose 64-bit registers (with names like RAX and RBX), plus the program counter, a stack pointer, and flag registers. The crucial fact: all computation happens in registers. To add two numbers, the CPU loads them from memory into registers, adds, and stores the result back. The CPU spends much of its life shuttling data between slow memory and these few fast registers.
2.3 The Memory Hierarchy — Why Locality Decides Speed
There is one unbreakable trade-off in hardware: fast memory is small and expensive; large memory is slow and cheap. So computers stack layers, each bigger and slower than the one above it.
/\ Registers <1 ns ~hundreds of bytes
/ \ L1 cache ~1 ns ~32-64 KB per core
/ fast\ L2 cache ~3-5 ns 256 KB-2 MB per core
/ small\ L3 cache ~10-20 ns tens of MB (shared)
/----------\ Main memory ~100 ns gigabytes (RAM)
/ slow big \ SSD ~150 us hundreds of GB-TB
/--------------\ Hard disk ~10 ms terabytes
The exact numbers vary by machine; the ratios are what matter. RAM is roughly 100–200× slower than L1 cache. An SSD is about 1,000× slower than RAM. A spinning hard disk seek is about 100,000× slower than RAM.
Caches (L1, L2, L3) are small, very fast memory (a type called SRAM) that hold copies of data the CPU recently used or will likely use. When the CPU needs data it checks L1, then L2, then L3, then RAM. Finding it early is a cache hit; not finding it is a cache miss, which stalls the CPU while it waits for a slower layer. L1 and L2 are usually per core (each core has its own); L3 is usually shared across all cores. Caches move data in fixed-size chunks called cache lines — typically 64 bytes — so touching one byte pulls in its 64-byte neighborhood.
Caches work because of locality, the single most important performance idea in this section:
- Temporal locality — data you used recently, you will probably use again soon. (So keep it cached.)
- Spatial locality — data near what you just used, you will probably use soon. (So pull in whole cache lines.)
O(n×n) amount of work, but the column version can be 5–10× slower purely from poor spatial locality.2.4 Virtual Memory and Paging
Each running program is given the illusion that it owns one huge, private, continuous block of memory — its virtual address space. In reality, physical RAM is shared among many programs and scattered into fragments. The operating system, together with a hardware unit on the CPU called the MMU (Memory Management Unit), translates the virtual address a program sees into the real physical address in RAM.
- Memory is divided into fixed-size pages on the virtual side and matching frames on the physical side — almost always 4 KB each (with optional larger "huge pages" of 2 MB or 1 GB).
- Each process has a page table that maps each virtual page to a physical frame.
- Walking the page table on every access would be slow, so the MMU keeps a TLB (Translation Lookaside Buffer) — a small hardware cache of recent translations. A TLB hit is instant; a TLB miss forces a slower walk of the page table.
- If a page isn't in RAM at all, the MMU raises a page fault; the OS loads the page (maybe from disk swap space) and retries. When this happens constantly it's called thrashing, and performance collapses.
Virtual memory gives three big wins: isolation (one process literally cannot read or corrupt another's memory — a security and stability wall), the illusion of more memory than the physical RAM, and a clean, uniform address space so every program can be written as if it owns the machine.
2.5 Processes vs Threads
A process is a running program with its own isolated virtual address space. Two processes cannot see each other's memory by default; to communicate they must use explicit channels (pipes, network sockets, or shared-memory segments) — together called IPC (Inter-Process Communication). Each process has its own code, heap, global variables, open files, and at least one thread.
A thread is a single line of execution inside a process. All threads in one process share the same memory — the same heap, globals, and code — but each thread has its own stack, own registers, and own program counter. That shared memory makes threads cheap to create and fast to communicate. It is also exactly what makes concurrency bugs possible.
PROCESS (one private address space) +------------------------------------------+ | Code | Globals | Heap (SHARED) | |------------------------------------------| | Thread 1: own stack + own registers | | Thread 2: own stack + own registers | +------------------------------------------+
2.6 Stack vs Heap
Inside a process's address space, two regions grow toward each other.
high addresses +------------------+ | Stack | grows DOWN v (function call frames) | | | | v | | (free gap) | | ^ | | | | | Heap | grows UP ^ (dynamic objects) +------------------+ | Globals | | Code (text) | +------------------+ low addresses
| Aspect | Stack | Heap |
|---|---|---|
| Stores | Function call frames: parameters, local variables, return address | Dynamically sized, longer-lived data (objects, big buffers) |
| Speed | Very fast — just move the stack pointer | Slower — an allocator must find free space |
| Freed by | Automatically, when the function returns | Manually (free) or by a garbage collector |
| Size | Small & fixed (often ~1–8 MB) | Large, flexible |
| Typical failure | Stack overflow (e.g. infinite recursion) | Memory leaks, use-after-free, fragmentation |
Each function call pushes a stack frame; returning pops it. A garbage collector (GC) is a background helper (in Java, Python, JavaScript) that automatically frees heap memory no longer used.
2.7 The Scheduler, Context Switches, and User vs Kernel Mode
One CPU core runs exactly one thread at any instant. The scheduler (part of the OS) rapidly rotates many threads onto the available cores, giving each a short time slice before switching to the next. Because it switches thousands of times a second, everything appears to run at once. This is preemptive multitasking — the OS can pause ("preempt") a thread without asking it.
Swapping one thread for another is a context switch: the CPU saves the current thread's registers and program counter, then loads another's. A switch between threads of the same process is cheap, because the memory map (and the TLB) stays the same. A switch between different processes is expensive: the address space changes, so the TLB must be flushed (its cached translations are now wrong) and the caches go cold. Context switches are pure overhead — useful work stops while they happen.
User mode vs kernel mode are hardware-enforced privilege levels. Your code runs in user mode, with no direct access to hardware or other processes' memory. The OS kernel (its trusted core) runs in kernel mode with full access. The only controlled doorway between them is a system call — a request like "open this file" or "send these bytes on this socket." A system call traps into the kernel, switches to kernel mode, does the privileged work, and returns. System calls are relatively expensive, which is why fast code minimizes them (by batching and buffering I/O).
2.8 Why This Is the Foundation for Concurrency
Two facts from this section combine into the central challenge of all concurrent programming. First, the scheduler interleaves threads unpredictably — you cannot know exactly when one thread pauses and another resumes. Second, threads in a process share memory. When two threads read-modify-write the same shared data without coordination, the unlucky interleaving produces a wrong result. This is a race condition, and it's why locks and atomic operations exist.
- All code becomes machine code; the difference between compiled, interpreted, bytecode+VM, and JIT is only when that translation happens — and modern JITs can be very fast.
- The CPU runs a fetch–decode–execute loop on data held in a few ultra-fast registers; real chips pipeline and run instructions out of order to stay busy.
- Memory is a hierarchy where ratios rule: RAM is ~100× slower than L1 cache, disk ~100,000× slower than RAM. Locality-friendly code beats locality-hostile code even at the same Big-O.
- Virtual memory gives every process an isolated, clean address space; the MMU + TLB translate virtual pages to physical frames, with page faults and swap as the slow fallback.
- A process is isolated memory; threads share that memory but keep their own stack and registers — fast to communicate, but the source of race conditions.
- The scheduler time-slices threads; context switches and system calls cost real time (process switches flush the TLB), so minimize them and match thread count to cores.