Databases II — How Databases Store & Find Data Fast (Indexes, B-Trees, LSM)
In the last section you learned what a database is. Now we open the hood and look at the most important question of all: when you ask a database for one row out of a billion, how does it find that row almost instantly? The answer is a small set of clever ideas — pages, indexes, B-trees, and logs — that together turn an impossible search into a few quick steps. Understanding them is the difference between an app that feels instant and one that grinds to a halt.
1. The physical reality: data lives in fixed-size pages
A database does not read one row at a time from disk. Storage is divided into fixed-size chunks called pages (also called blocks). A page is the smallest unit of input/output — the smallest amount the database ever reads from or writes to disk. To read one row, the engine reads the entire page that contains it into memory. To change one byte, it must eventually write the whole page back.
Real-world sizes: PostgreSQL uses 8 KB pages; MySQL/InnoDB uses 16 KB pages. (An OS filesystem block is usually 4 KB.) Why pages exist: disks and SSDs (solid-state drives — fast flash storage) are far quicker at moving one big chunk than thousands of tiny scattered reads. A page holds a small header, an array of pointers to rows, and the rows themselves.
2. Why a "full table scan" is slow
A full table scan (also called a sequential scan) reads every page of a table to find matching rows. Imagine a table of 10 million users and the query WHERE email = 'sam@x.com' with no index. The engine reads all hundreds of thousands of pages — even though exactly one row matches. The cost grows O(n): "order n", meaning the work grows in direct proportion to the table size. Double the table, double the work.
But a sequential scan is not always bad. If your query returns most of the table (say "give me all orders"), reading every page in order is actually the fastest way, because sequential reads are disk- and cache-friendly. The database planner deliberately chooses a scan for these low-selectivity queries. A scan is only the villain for selective queries — ones that return a tiny fraction of rows — where reading the whole table to find a handful of matches is enormous waste.
Seq Scan in a query plan is a bug to fix. For a query returning most of the table, the scan is genuinely cheaper than thousands of random index lookups, and the planner is right to pick it.3. Indexes: the core idea
An index is a separate, sorted data structure that lets the engine jump straight to matching rows instead of scanning everything. Each index entry stores the indexed column's value plus a pointer to where the full row lives. In PostgreSQL that pointer is a ctid (a "tuple ID" = page number + slot inside the page). The index is kept sorted by value, so the engine can binary-search it.
Indexes trade space and write cost for read speed — a deal that is almost always worth it for the columns you actually search on.
4. B-trees: how O(log n) actually happens
The default index in PostgreSQL, MySQL, SQL Server, Oracle, and SQLite is the B-tree — technically a B+tree. A B+tree is a tree of pages with these properties:
- Balanced — every leaf (bottom node) is the same distance from the top (the root). So every lookup costs the same number of steps, no matter which value you search.
- High fanout — each node is one page, and a page holds hundreds to ~1,000 keys. "Fanout" = how many children each node points to. High fanout means a very wide, very shallow tree.
- Sorted, linked leaves — all real keys live in the leaves, and the leaves are linked left-to-right in sorted order.
Here is the magic with concrete numbers. Suppose each node holds about 1,000 keys. Then a tree just 3 levels deep reaches 1,000 × 1,000 × 1,000 = 1 billion rows. A lookup walks root → internal node → leaf, doing a quick binary search inside each node. That is only 3 page reads to find one row among a billion — and the top one or two levels are almost always cached in memory, so often only the final leaf is a real disk read.
[ ROOT page ] level 1 (cached)
/ | \
[internal] [internal] [internal] level 2 (cached)
/ | \ / | \ / | \
[leaf][leaf] ... sorted (key -> row pointer) ... level 3
<--> <--> <--> leaves linked for range scans -->
~1000 x 1000 x 1000 = ~1 billion rows in 3 reads
That walk takes O(log n) time — "log n" grows incredibly slowly. Going from a million rows to a billion adds only one extra level. Compare that to the sequential scan's O(n): from "read every page" to "read 3–4 pages."
Because the leaves are sorted and linked, a B-tree also serves range queries — BETWEEN, >, <, ORDER BY, and prefix searches like LIKE 'abc%' — by finding the first matching leaf and walking sideways. This is something a hash index (which only does exact "equals" lookups) cannot do.
ORDER BY for free.5. Clustered vs secondary indexes (a big architectural fork)
There are two ways a database can relate the index to the actual rows, and the two most popular engines pick differently.
- Clustered index (InnoDB / MySQL): the table itself is stored physically sorted by a key — and that key is the primary key (the column that uniquely identifies each row). The leaf nodes are the rows. There is exactly one clustered index per table. Looking up by primary key lands you directly on the row — no second hop.
- Secondary index: a separate B-tree whose leaves hold the key plus a pointer back to the row. In InnoDB, that pointer is the primary-key value. So a search by a secondary column (say email) does two tree walks: first the email index gives you a primary key, then the clustered index turns that primary key into the actual row. This is the "double lookup."
- PostgreSQL is different: it stores the table as an unordered heap (a pile of rows in no particular order), and all indexes are secondary, pointing at a heap location (
ctid). There is no clustered index.
WHERE email = 'sam@x.com'. (1) Walk the email index → it returns user_id = 42. (2) Walk the clustered primary-key index for 42 → that lands on the real row. Two traversals. A covering index (next section) can remove the second one.6. Composite and covering indexes
A composite index covers multiple columns, e.g. on (last_name, first_name). It is sorted by last_name first, then by first_name within each last name — exactly like a phone book.
The leftmost-prefix rule: a composite index on (a, b, c) serves WHERE a=…, a=… AND b=…, and all three together — but not WHERE b=… alone. So column order matters. A solid rule of thumb: put equality columns first, then ORDER BY/sort columns, and put range columns (>, <, BETWEEN) last.
A covering index contains every column a query needs, so the engine answers the query from the index alone and never touches the table — PostgreSQL calls this an index-only scan. PostgreSQL and SQL Server support CREATE INDEX … (email) INCLUDE (name, created_at): the INCLUDEd columns are stored in the leaf for retrieval but are not sort keys. This eliminates the second hop (the heap or clustered lookup) and is dramatically faster.
INCLUDE. The query then never visits the table at all. (PostgreSQL caveat: an index-only scan still checks a small "visibility map" to confirm the row is current — a poorly-vacuumed table can quietly disable it.)7. The fundamental trade-off: reads vs writes
Indexes speed up reads, but nothing is free:
- They slow down writes. Every
INSERT,UPDATE, andDELETEmust update the table and every affected index. Five indexes mean five extra structures to maintain on every write. - They cost space. Each index is a copy of its columns plus pointers. Indexes together can exceed the size of the table itself.
- They need maintenance. Over time indexes fragment and bloat, and must be vacuumed or rebuilt (
REINDEX).
pg_stat_user_indexes to find unused ones).8. Write-optimized storage: LSM-trees vs B-trees
A B-tree does in-place updates: to change a row it finds the page and rewrites it — a random write (writing to a scattered location). Under very heavy write load, random writes become the bottleneck. The LSM-tree (Log-Structured Merge-tree) — used by Cassandra, RocksDB, LevelDB, ScyllaDB, and HBase — solves this by turning random writes into fast sequential writes (appending to the end of a file). Here is the write path:
- A write is appended to the WAL (write-ahead log, for durability — see next section) and inserted into an in-memory sorted structure called the memtable.
- When the memtable fills, it is frozen and flushed sequentially to disk as an immutable file: an SSTable (Sorted String Table) — key-value pairs sorted by key, plus a small index and a Bloom filter.
- SSTables pile up. A background compaction process merge-sorts overlapping SSTables into fewer, larger ones, throwing away overwritten values and deletions (a delete is recorded as a marker called a tombstone).
write -> WAL (durable) -> MEMTABLE (RAM, sorted)
| full -> flush (sequential)
v
SSTable SSTable SSTable <- L0 (immutable, sorted)
| compaction (merge-sort, drop dups + tombstones)
v
larger SSTables <- L1, L2 ... bigger, fewer
The catch is reads. A key might live in the memtable or in any SSTable. To avoid opening every file, each SSTable carries a Bloom filter — a tiny probabilistic structure that answers "definitely not in this file" or "maybe in this file." It lets the engine skip files that cannot contain the key (roughly 10 bits per key gives under ~1% false positives).
| Aspect | B-tree | LSM-tree |
|---|---|---|
| Writes | Random, in-place (slower under heavy load) | Sequential append (very high throughput) |
| Reads | Predictable, usually 3–4 page reads | May check several SSTables (read amplification) |
| Extra work | Page splits, fragmentation | Compaction rewrites data (write amplification), latency spikes |
| Best for | Read-heavy, general-purpose OLTP | Write-heavy, high-ingest, time-series |
9. The WAL: durability and crash recovery
The Write-Ahead Log (WAL) follows one rule: log the change before you change the data. Before a transaction commits, its changes are written and fsync'd (forced safely to disk) into the sequential WAL. Only then is the commit acknowledged to the client. The actual data pages can be updated in memory and written to disk lazily later, during a checkpoint (which flushes dirty pages and lets old WAL be recycled).
PostgreSQL calls this the WAL; InnoDB calls it the redo log — same idea. LSM-trees use a WAL too, to protect the in-memory memtable. Tuning knobs (synchronous_commit, innodb_flush_log_at_trx_commit) trade a little crash-safety for speed.
10. The buffer pool: caching pages in RAM
The buffer pool (PostgreSQL: shared_buffers; InnoDB: the buffer pool, often 50–75% of RAM) is an in-memory cache of recently used pages. Every read and write goes through it. When a page is requested, the engine checks the buffer pool first — a buffer hit means no disk I/O at all. Modified ("dirty") pages are written to disk later at checkpoints.
This is why the second run of a query is often dramatically faster, and why the top levels of a B-tree are effectively free — they stay resident in RAM. Keeping your "working set" (the data you actually touch) in the buffer pool is one of the highest-leverage database optimizations.
11. Query planning and fixing a slow query
SQL is declarative: you state what you want, and the query planner (optimizer) decides how — which index, which scan type, which join order. It uses table statistics (row counts and value distributions, gathered by the ANALYZE command) to estimate selectivity (what fraction of rows match) and picks the cheapest plan.
EXPLAINshows the estimated plan without running the query.EXPLAIN ANALYZEactually runs it and shows real timings and row counts. AddBUFFERSto see pages hit in cache vs read from disk.
Before:
Seq Scan on orders (rows=1) actual time=210ms — reading the whole table to find one order.Fix:
CREATE INDEX idx_orders_customer ON orders(customer_id);After:
Index Scan using idx_orders_customer ... actual time=0.3ms, and BUFFERS shows shared reads dropping from thousands to a handful. A ~700× speedup from one index.Reading a plan: look for a Seq Scan on a big table feeding a selective filter (smells like a missing index), Index Scan/Index Only Scan (good), and a big gap between estimated and actual rows (a sign of stale statistics — run ANALYZE). A Nested Loop over many rows often means a missing index on the join column.
pg_stat_statements or the slow-query log); (2) run EXPLAIN ANALYZE; (3) spot the Seq Scan or expensive filter; (4) add an index on the WHERE/JOIN/ORDER BY columns (equality columns first, range last); (5) consider INCLUDE columns to make it covering; (6) re-run EXPLAIN ANALYZE and confirm the planner actually uses the new index and reads fewer buffers.WHERE lower(email)=…) or casting its type (WHERE date_col::text=…) — match the column exactly or build an expression index. A leading wildcard (LIKE '%abc') can't use a normal B-tree (only 'abc%' can) — use a trigram/full-text index instead. And creating an index does not guarantee it's used: stale stats or low selectivity may make the planner ignore it, so always re-check the plan.- Databases read and write fixed-size pages (Postgres 8 KB, InnoDB 16 KB); cost is counted in pages read, not rows.
- A B-tree index turns an O(n) full scan into an O(log n) lookup — ~3–4 page reads even across a billion rows — and its sorted, linked leaves power range scans and
ORDER BY. - InnoDB stores rows inside the clustered primary-key index (secondary lookups cost two traversals); PostgreSQL keeps an unordered heap with all-secondary indexes.
- Composite indexes obey the leftmost-prefix rule (equality columns first, range last); a covering index answers a query without touching the table.
- Indexes speed reads but tax every write and consume space — index for real query patterns, not every column.
- LSM-trees win on write-heavy workloads (sequential writes + Bloom filters) at the cost of read/write amplification; B-trees give predictable reads for general OLTP. The WAL guarantees durability and crash recovery, and the buffer pool caches hot pages in RAM.
- Diagnose with
EXPLAIN ANALYZE (BUFFERS): find the scan, add the right index, and verify the planner actually chose it.