How a Computer Runs Your Program: CPU, Memory, Processes & Threads

By Pritesh Yadav 15 min read

You write code in a language like Python, Java, or JavaScript. But a computer's brain — the CPU (Central Processing Unit, the chip that does all the calculating) — does not understand any of those languages. It understands only machine code: long strings of binary numbers (1s and 0s) that are instructions for one specific kind of chip. This section follows your program all the way down: how your text becomes machine code, how the CPU runs it, where the data lives, and how the operating system juggles many programs at once. Understanding this layer is what separates a developer who guesses at performance from one who knows.

2.1 From Source Code to Something the CPU Can Run

Your source code (the text you type) must be translated into machine code. There are three main strategies, and real languages mix them.

1. Compilation (AOT — ahead-of-time). A compiler (a program that translates code) turns your whole program into machine code before it runs. Languages like C, C++, Rust, and Go work this way. The result is a standalone executable file that is fast and needs no translator at run time — but it is tied to one type of chip and operating system. Many errors are caught during compiling, before the program ever runs.

2. Interpretation. An interpreter reads your program and runs it one statement at a time, translating as it goes. This is flexible and portable (the same code runs anywhere the interpreter exists), but slower, because the code is re-analyzed every time it runs.

3. Bytecode + a virtual machine. Languages like Java and Python first compile to bytecode — a compact, portable intermediate code that is not machine code for any real chip. Bytecode runs on a virtual machine (VM), which is a piece of software that pretends to be a CPU. Java's javac tool produces .class bytecode that the JVM (Java Virtual Machine) executes. Python produces .pyc bytecode that CPython runs. This gives "write once, run anywhere."

4. JIT (just-in-time compilation). Modern VMs are clever. They start by interpreting bytecode, watch which functions run over and over (these are called "hot" paths), and then compile just those into native machine code while the program is still running. The JVM's HotSpot engine and every modern JavaScript engine (like Google's V8) do this. Because a JIT can see real data as the program runs, it can sometimes produce code faster than a plain ahead-of-time compiler.

C:        hello.c  --compiler+linker-->  native executable  --> CPU
Java:     Hello.java --javac--> .class bytecode --> JVM
                          (interprets, then JIT-compiles hot methods)
Python:   hello.py  --> .pyc bytecode  --> CPython interprets
Key takeaway: Everything ends up as machine code the CPU executes. The only question is when the translation happens — before running (AOT), during running (JIT), or piece-by-piece every time (interpretation).
Common mistake: Believing "compiled = fast, interpreted = slow" as an absolute rule. Modern JIT engines (V8, JVM HotSpot) compile hot code to native machine code at run time and can rival or beat naive compiled code. Python is slow for separate reasons (its dynamic typing and a lock called the GIL), not simply because "it's interpreted."

2.2 The CPU and the Fetch–Decode–Execute Cycle

At its heart, a CPU does one thing in a loop, forever:

  1. Fetch — read the next instruction from memory. The CPU keeps the address of that instruction in a special slot called the program counter (PC), also called the instruction pointer.
  2. Decode — work out what the instruction means and what data it needs.
  3. Execute — the ALU (Arithmetic Logic Unit, the part that does math and logic) performs the operation; the result goes into a register or memory.
  4. Advance the PC to the next instruction (or jump elsewhere), then repeat.

This loop is driven by the clock, an electronic pulse that ticks billions of times per second. A 3 GHz CPU ticks 3 billion times per second, so one tick (cycle) takes about 0.3 nanoseconds. (A nanosecond is one-billionth of a second.)

Real CPUs are far cleverer than "one instruction at a time." Pipelining overlaps the fetch/decode/execute stages of several instructions, like an assembly line. Superscalar chips run several instructions in the same cycle. Out-of-order and speculative execution, helped by branch prediction (guessing which way an if will go), keep the assembly line full. But fetch-decode-execute is the correct mental model to start from.

Registers — the fastest storage that exists

Registers are a tiny set of storage slots inside the CPU itself. They are the fastest memory there is, read in well under one clock cycle. An x86-64 chip has about 16 general-purpose 64-bit registers (with names like RAX and RBX), plus the program counter, a stack pointer, and flag registers. The crucial fact: all computation happens in registers. To add two numbers, the CPU loads them from memory into registers, adds, and stores the result back. The CPU spends much of its life shuttling data between slow memory and these few fast registers.

2.3 The Memory Hierarchy — Why Locality Decides Speed

There is one unbreakable trade-off in hardware: fast memory is small and expensive; large memory is slow and cheap. So computers stack layers, each bigger and slower than the one above it.

        /\          Registers   <1 ns     ~hundreds of bytes
       /  \         L1 cache     ~1 ns     ~32-64 KB per core
      / fast\       L2 cache     ~3-5 ns   256 KB-2 MB per core
     /  small\      L3 cache     ~10-20 ns tens of MB (shared)
    /----------\    Main memory  ~100 ns   gigabytes (RAM)
   /  slow big  \   SSD          ~150 us   hundreds of GB-TB
  /--------------\  Hard disk    ~10 ms    terabytes

The exact numbers vary by machine; the ratios are what matter. RAM is roughly 100–200× slower than L1 cache. An SSD is about 1,000× slower than RAM. A spinning hard disk seek is about 100,000× slower than RAM.

Analogy: Scale every time up by one billion so it lands in human terms. One CPU cycle becomes 1 second. Reaching L1 cache: a few seconds. Reaching RAM: about 6 minutes. Reading from an SSD: about 2 days. A hard-disk seek: about a year. Suddenly "just read it from disk" sounds as crazy as it really is to the CPU.

Caches (L1, L2, L3) are small, very fast memory (a type called SRAM) that hold copies of data the CPU recently used or will likely use. When the CPU needs data it checks L1, then L2, then L3, then RAM. Finding it early is a cache hit; not finding it is a cache miss, which stalls the CPU while it waits for a slower layer. L1 and L2 are usually per core (each core has its own); L3 is usually shared across all cores. Caches move data in fixed-size chunks called cache lines — typically 64 bytes — so touching one byte pulls in its 64-byte neighborhood.

Caches work because of locality, the single most important performance idea in this section:

  • Temporal locality — data you used recently, you will probably use again soon. (So keep it cached.)
  • Spatial locality — data near what you just used, you will probably use soon. (So pull in whole cache lines.)
Example: Summing every number in a big 2D grid. If you walk it row by row, you visit memory in order — each 64-byte cache line is fully used before moving on. Fast. If you walk it column by column, you jump far between each access, wasting most of every cache line you load. Same O(n×n) amount of work, but the column version can be 5–10× slower purely from poor spatial locality.
Common mistake: Treating RAM as "fast" and all memory access as equal cost. In textbook algorithm analysis memory is uniform-cost, but on real hardware a cache miss to RAM can dominate the runtime. Two programs with identical Big-O can differ 10× from locality alone. Cache-friendly data layout (arrays beat pointer-chasing linked lists) is a real, measurable win.

2.4 Virtual Memory and Paging

Each running program is given the illusion that it owns one huge, private, continuous block of memory — its virtual address space. In reality, physical RAM is shared among many programs and scattered into fragments. The operating system, together with a hardware unit on the CPU called the MMU (Memory Management Unit), translates the virtual address a program sees into the real physical address in RAM.

  • Memory is divided into fixed-size pages on the virtual side and matching frames on the physical side — almost always 4 KB each (with optional larger "huge pages" of 2 MB or 1 GB).
  • Each process has a page table that maps each virtual page to a physical frame.
  • Walking the page table on every access would be slow, so the MMU keeps a TLB (Translation Lookaside Buffer) — a small hardware cache of recent translations. A TLB hit is instant; a TLB miss forces a slower walk of the page table.
  • If a page isn't in RAM at all, the MMU raises a page fault; the OS loads the page (maybe from disk swap space) and retries. When this happens constantly it's called thrashing, and performance collapses.
Example (address translation): Take a 32-bit virtual address with 4 KB pages. Since 4 KB = 2¹² bytes, the bottom 12 bits are the offset inside the page, and the top 20 bits are the page number. The MMU looks up the 20-bit page number in the TLB/page table to find a physical frame, then glues the unchanged 12-bit offset onto it. The offset never changes — only the page-number part gets remapped.

Virtual memory gives three big wins: isolation (one process literally cannot read or corrupt another's memory — a security and stability wall), the illusion of more memory than the physical RAM, and a clean, uniform address space so every program can be written as if it owns the machine.

Common mistake: Confusing a virtual address with a physical one. The pointer value your program prints is virtual. The same virtual address in two different processes points to different physical RAM. Never assume two processes share memory just because their addresses look the same.

2.5 Processes vs Threads

A process is a running program with its own isolated virtual address space. Two processes cannot see each other's memory by default; to communicate they must use explicit channels (pipes, network sockets, or shared-memory segments) — together called IPC (Inter-Process Communication). Each process has its own code, heap, global variables, open files, and at least one thread.

A thread is a single line of execution inside a process. All threads in one process share the same memory — the same heap, globals, and code — but each thread has its own stack, own registers, and own program counter. That shared memory makes threads cheap to create and fast to communicate. It is also exactly what makes concurrency bugs possible.

PROCESS  (one private address space)
+------------------------------------------+
|  Code  |  Globals  |  Heap   (SHARED)     |
|------------------------------------------|
|  Thread 1:  own stack + own registers    |
|  Thread 2:  own stack + own registers    |
+------------------------------------------+
Analogy: A process is a private kitchen with its own pantry no other kitchen can touch. The threads are several cooks sharing that one kitchen and pantry (shared memory), but each cook has their own cutting board (own stack). The head chef rotating cooks on and off the single stove is the scheduler doing context switches.
Common mistake: Assuming threads in one process have separate memory. They share the heap, globals, and code — that's the whole point and the whole danger. Two threads writing the same variable without coordination causes bugs. Best practice: protect shared changing data with locks or atomic operations, or avoid sharing it at all.

2.6 Stack vs Heap

Inside a process's address space, two regions grow toward each other.

 high addresses
 +------------------+
 |   Stack          |  grows DOWN  v   (function call frames)
 |        |         |
 |        v         |
 |   (free gap)     |
 |        ^         |
 |        |         |
 |   Heap           |  grows UP    ^   (dynamic objects)
 +------------------+
 |   Globals        |
 |   Code (text)    |
 +------------------+
 low addresses
AspectStackHeap
StoresFunction call frames: parameters, local variables, return addressDynamically sized, longer-lived data (objects, big buffers)
SpeedVery fast — just move the stack pointerSlower — an allocator must find free space
Freed byAutomatically, when the function returnsManually (free) or by a garbage collector
SizeSmall & fixed (often ~1–8 MB)Large, flexible
Typical failureStack overflow (e.g. infinite recursion)Memory leaks, use-after-free, fragmentation

Each function call pushes a stack frame; returning pops it. A garbage collector (GC) is a background helper (in Java, Python, JavaScript) that automatically frees heap memory no longer used.

Common mistake: Putting huge or unknown-size data on the stack causes a stack overflow (the small stack runs out of room). And assuming heap allocation is free — it isn't; it's slower and can fragment. Prefer the stack for small, short-lived, fixed-size data.

2.7 The Scheduler, Context Switches, and User vs Kernel Mode

One CPU core runs exactly one thread at any instant. The scheduler (part of the OS) rapidly rotates many threads onto the available cores, giving each a short time slice before switching to the next. Because it switches thousands of times a second, everything appears to run at once. This is preemptive multitasking — the OS can pause ("preempt") a thread without asking it.

Swapping one thread for another is a context switch: the CPU saves the current thread's registers and program counter, then loads another's. A switch between threads of the same process is cheap, because the memory map (and the TLB) stays the same. A switch between different processes is expensive: the address space changes, so the TLB must be flushed (its cached translations are now wrong) and the caches go cold. Context switches are pure overhead — useful work stops while they happen.

User mode vs kernel mode are hardware-enforced privilege levels. Your code runs in user mode, with no direct access to hardware or other processes' memory. The OS kernel (its trusted core) runs in kernel mode with full access. The only controlled doorway between them is a system call — a request like "open this file" or "send these bytes on this socket." A system call traps into the kernel, switches to kernel mode, does the privileged work, and returns. System calls are relatively expensive, which is why fast code minimizes them (by batching and buffering I/O).

Best practice: Treat context switches and system calls as costing real time. Don't make thousands of tiny syscalls when one batched call works. Right-size thread pools to the workload — more threads is not automatically faster.

2.8 Why This Is the Foundation for Concurrency

Two facts from this section combine into the central challenge of all concurrent programming. First, the scheduler interleaves threads unpredictably — you cannot know exactly when one thread pauses and another resumes. Second, threads in a process share memory. When two threads read-modify-write the same shared data without coordination, the unlucky interleaving produces a wrong result. This is a race condition, and it's why locks and atomic operations exist.

Common mistake: Confusing concurrency with parallelism, and assuming "more threads = more speed." Concurrency is interleaving many tasks (even on one core via the scheduler). Parallelism is running them truly at the same time on multiple cores. Past the core count — or with heavy lock contention or constant context switching — adding threads makes things slower.
Common mistake (false sharing): Two threads that update different variables which happen to sit on the same 64-byte cache line will fight over that line, silently destroying multi-thread performance even though they never logically share data. Awareness of cache lines prevents this.
Key takeaways:
  • All code becomes machine code; the difference between compiled, interpreted, bytecode+VM, and JIT is only when that translation happens — and modern JITs can be very fast.
  • The CPU runs a fetch–decode–execute loop on data held in a few ultra-fast registers; real chips pipeline and run instructions out of order to stay busy.
  • Memory is a hierarchy where ratios rule: RAM is ~100× slower than L1 cache, disk ~100,000× slower than RAM. Locality-friendly code beats locality-hostile code even at the same Big-O.
  • Virtual memory gives every process an isolated, clean address space; the MMU + TLB translate virtual pages to physical frames, with page faults and swap as the slow fallback.
  • A process is isolated memory; threads share that memory but keep their own stack and registers — fast to communicate, but the source of race conditions.
  • The scheduler time-slices threads; context switches and system calls cost real time (process switches flush the TLB), so minimize them and match thread count to cores.

Continue reading