Databases I — Relational Databases, SQL & ACID

By Pritesh Yadav 18 min read

Almost every serious application needs to store data — orders, users, payments, messages — and get it back later, correctly, even when many people use the app at the same time. The tool that does this job is a database. In this section we build up from "what is a database?" to the deep guarantees (called ACID) that let a bank move your money without ever losing a cent. Take it slowly; each idea builds on the last.

1. What is a database, and why not just use files?

A database is a structured system for storing, searching, and safely changing data — especially when many users touch it at once. A DBMS (Database Management System) is the software that runs the database. Popular ones are PostgreSQL, MySQL (and its cousin MariaDB), SQLite, SQL Server, and Oracle.

You could, in theory, store your orders in plain files — a CSV (comma-separated values) file, or a folder of JSON files. People try this and it falls apart quickly. Here is why a real database wins on five fronts:

  • Concurrency control — "concurrency" means many things happening at the same time. If two parts of your program write to the same file at once, they can overwrite each other and corrupt the file. A database safely interleaves thousands of simultaneous writers.
  • Querying — a "query" is a question you ask of the data. With files, "all orders over $100 placed last week by customers in Texas" means writing a loop by hand. A database answers it in one line of SQL.
  • Integrity — "integrity" means the data stays valid and consistent. A database can refuse to save an order for a customer that doesn't exist, or two users with the same ID. Files happily save garbage.
  • Crash safety — if the power dies halfway through writing a file, you get a half-written, broken file. A database guarantees a write either fully happened or didn't happen at all.
  • Indexes — an "index" is a lookup structure (like a book's index) that finds one row fast. Without it, finding one row among 10 million means reading all 10 million. We cover indexes deeply in a later section.
Analogy: A flat file is a single shared notebook on a desk — fine for one person, chaos when ten people grab it at once and scribble. A DBMS is a well-run library: a librarian (the DBMS) checks books in and out, keeps a catalog so you find any book instantly, and never lets two people write on the same page at the same time.

2. The relational model: tables, rows, columns, keys

A relational database organizes data into tables. Think of a table as a spreadsheet:

  • A column (also called a field) is a vertical category, like email or price. Each column has a fixed data type — the kind of value it holds (a number, text, a date).
  • A row (also called a record) is one horizontal entry — one customer, one order.

Two special kinds of columns hold the model together:

  • Primary key (PK) — a column (or group of columns) that uniquely identifies each row. It must be unique (no two rows share it) and not null ("null" means "no value" — a PK can never be empty). Most tables use a surrogate key: a meaningless-but-stable ID, like an auto-incrementing number (1, 2, 3…) or a UUID (a long random unique string).
  • Foreign key (FK) — a column whose value must match a primary key in another table. This enforces referential integrity: the database won't let you save an order whose customer_id points at a customer that doesn't exist. You also control what happens when the referenced row is deleted — ON DELETE RESTRICT blocks the delete, ON DELETE CASCADE deletes the children too.
Common mistake: Using a real-world, changeable value (like an email address) as the primary key. Emails change. When they do, every row pointing at that key breaks. Use a stable surrogate key instead and let email be a normal UNIQUE column.

Relationships: how tables connect

Cardinality means "how many on each side."

TypeMeaningHow to build it
1:1One row maps to exactly one row elsewhere (a user and their user_profile).A foreign key that is also marked UNIQUE.
1:manyThe common case: one customer has many orders.Put the FK on the many side: orders.customer_id.
many:manyAn order has many products, and a product appears on many orders.You need a join table (next paragraph).

A single foreign key cannot represent many:many. You need a third table — a join table (also called a junction or bridge table). For orders and products, that's order_items(order_id, product_id, quantity, unit_price), holding a FK to each side. Crucially, the join table also stores facts about the relationship itself — here, the quantity and the price paid.

 customers          orders            order_items        products
 +--------+        +----------+      +-------------+    +---------+
 | id PK  |---<    | id PK    |--<   | order_id FK |    | id PK   |
 | name   |   \    | cust. FK |   \  | product_id  |>---| name    |
 | city   |    \-->| total    |    \>| quantity    |    | price   |
 +--------+        +----------+      | unit_price  |    | stock   |
                        ^            +-------------+    +---------+
                        |
                   +----------+
                   | payments |  (many payments : 1 order)
                   +----------+
   "---<" means "one-to-many" (crow's foot points at the 'many' side)

3. SQL basics: asking and changing

SQL = Structured Query Language, the language for talking to a relational database. The four core DML (Data Manipulation Language) statements:

  • SELECT name, total FROM orders WHERE total > 100 ORDER BY total LIMIT 10;read data.
  • INSERT INTO orders (customer_id, total) VALUES (5, 99.50);create a row.
  • UPDATE orders SET total = 120 WHERE id = 10;change rows.
  • DELETE FROM orders WHERE id = 10;remove rows.
Common mistake (career-ending edition): Running UPDATE or DELETE without a WHERE clause. DELETE FROM orders; deletes every order in the table. UPDATE accounts SET balance = 0; zeroes out everyone. The WHERE says which rows; forget it and the change hits all of them. Always write the WHERE first.

JOINs: combining tables

A JOIN stitches rows from two tables together using a matching condition (the ON clause).

Join typeWhat it returns
INNER JOINOnly rows that have a match on both sides.
LEFT JOINAll rows from the left table, plus matches from the right (NULLs where there's no match).
RIGHT JOINMirror image: all rows from the right table.
FULL OUTER JOINAll rows from both sides, matched where possible.

Basic inner join — customers and their orders:

SELECT c.name, o.total FROM customers c JOIN orders o ON o.customer_id = c.id;

The classic LEFT JOIN use case — find customers who have never ordered. Keep all customers, attach orders where they exist, then keep only the rows where no order matched (the order columns came back NULL):

SELECT c.name FROM customers c LEFT JOIN orders o ON o.customer_id = c.id WHERE o.id IS NULL;

Common mistake: Using INNER JOIN when you needed LEFT JOIN. An inner join drops unmatched rows — so customers with zero orders silently vanish from your report, and you never notice the gap.

Aggregation: collapsing many rows into a summary

Aggregate functions reduce many rows to one number: COUNT (how many), SUM (total), AVG (average), MIN, MAX. GROUP BY produces one output row per group. Revenue per customer, keeping only big spenders:

SELECT customer_id, SUM(total) AS revenue FROM orders GROUP BY customer_id HAVING SUM(total) > 1000;

Notice two filters that look similar but run at different times:

  • WHERE filters individual rows before grouping. It cannot mention SUM or COUNT because the groups don't exist yet.
  • HAVING filters groups after grouping. This is where you say "only groups whose total exceeds 1000."
Key takeaway: WHERE runs first on raw rows; HAVING runs last on grouped results. To exclude refunded orders and keep only big spenders, you use both: WHERE status <> 'refunded' ... GROUP BY ... HAVING SUM(total) > 1000.

4. Schema and data types

The schema is the blueprint of your database — the tables, their columns, the types, and the rules — defined with CREATE TABLE. Picking the right data type matters for both correctness and storage:

  • INTEGER / BIGINT — whole numbers.
  • NUMERIC(p,s) / DECIMAL(p,s) — exact decimal numbers. Use this for money. (p = total digits, s = digits after the decimal point.)
  • VARCHAR(n) / TEXT — text.
  • BOOLEAN — true/false.
  • DATE, TIMESTAMP, TIMESTAMPTZ — dates and times; prefer the timezone-aware TIMESTAMPTZ.
  • UUID — a long random unique ID.
  • JSON / JSONB (PostgreSQL) — for flexible, semi-structured data.
Common mistake — money in FLOAT: A FLOAT/DOUBLE stores numbers in binary, and it cannot represent values like 0.10 exactly. In floating point, 0.1 + 0.2 comes out as 0.30000000000000004. Over thousands of orders these tiny errors accumulate into real, visible discrepancies in totals and payouts. Store money in NUMERIC(10,2), which holds 0.10 + 0.20 = 0.30 exactly.

Constraints are rules the database enforces so bad data can never be saved:

  • NOT NULL — value is required.
  • UNIQUE — no duplicates (e.g. one account per email).
  • CHECK (total >= 0) — a custom condition that must hold (e.g. stock can't go negative).
  • DEFAULT — a value used when none is given (e.g. status DEFAULT 'pending').
Best practice: Push business rules into constraints, not just app code. A CHECK (stock >= 0) guarantees you can never oversell even if a future bug forgets to check. The database is your last line of defense.

5. Normalization: removing harmful duplication

Normalization is the process of organizing columns so data isn't pointlessly duplicated. Duplication causes update anomalies — you change a customer's city in one row but forget the other ten, and now the data disagrees with itself. There are three main "normal forms," each building on the last. We'll evolve one messy table step by step.

Start with this badly designed table:
orders(id, customer_id, customer_city, product_id, product_name, quantity)

1NF — atomic values, no repeating groups

"Atomic" means each cell holds one value. No comma-lists like "red,blue,green" in a single column, and no repeating columns like phone1, phone2, phone3. If our order tried to cram several products into one row as a list, 1NF forces us to give each product its own row (or its own line-item record).

Common mistake: Stuffing "red,blue" into one cell. Now you can't easily filter "orders containing blue," can't index it, and can't count colors. It quietly destroys your ability to query.

2NF — no partial dependency on part of a composite key

This only matters when the primary key is composite (made of more than one column). Every non-key column must depend on the whole key, not just part of it. In a line-items table keyed by (order_id, product_id), the column product_name depends only on product_id — half the key. That's a partial dependency. Fix: move product_name out to a products table.

3NF — no transitive dependency

A transitive dependency is when a non-key column depends on another non-key column instead of the key. In orders(id, customer_id, customer_city), customer_city really depends on customer_id, not on the order. Fix: move customer_city to the customers table, where it belongs.

Key takeaway — the mnemonic: Every non-key column must depend on "the key, the whole key, and nothing but the key, so help me Codd." "The key" = 1NF (there is a key), "the whole key" = 2NF (no partial dependency), "nothing but the key" = 3NF (no transitive dependency).

Denormalization: breaking the rules on purpose

Denormalization means deliberately adding redundancy back to make reads faster — for example storing a precomputed order_total instead of summing line items on every page load. The trade-off: you must keep the copies in sync on every write. Choose it deliberately, only where read speed truly matters.

Key takeaway — snapshotting is NOT a violation: When an order copies the product's name and price at the moment of purchase, that is correct, not a normalization bug. The snapshot is a genuinely different fact ("price paid on June 1") than the live catalog price ("price today"). If you didn't snapshot, editing a product's price tomorrow would silently rewrite every past invoice — a serious accounting error.
Common mistake: Over-normalizing a heavy reporting query into 12 joins (slow), or over-denormalizing and forgetting to keep the duplicate copies in sync (wrong data). Both are failure modes. Normalize by default; denormalize only with a clear reason.

6. Transactions and ACID

A transaction is a group of SQL statements treated as one indivisible unit. You wrap them: BEGIN; … your statements … COMMIT; to make them permanent, or ROLLBACK; to undo everything. ACID is the set of four guarantees a good database gives transactions:

  • A — Atomicity (all-or-nothing): every statement commits, or none do. A crash or rollback midway leaves the database exactly as if the transaction never started.
  • C — Consistency: the transaction moves the database from one valid state to another, never violating constraints (PK, FK, CHECK, NOT NULL). If any statement would break a rule, the whole transaction is rejected.
  • I — Isolation: running transactions don't see each other's half-finished, uncommitted work. The result looks as if they ran one after another (how strictly depends on the "isolation level," next).
  • D — Durability: once COMMIT returns "done," the change survives a crash or power loss. This is achieved with a write-ahead log (WAL) — the change is written to a log on disk before the commit is acknowledged, so it can always be recovered.
Common mistake: Confusing the ACID "C" with the CAP "C." ACID consistency = "constraints stay valid." CAP consistency (a later networking topic) = "all replicas agree on the latest value." They are unrelated concepts that happen to share a letter.

The bank-transfer example — all four letters at once

Transfer $100 from Alice to Bob:

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
  UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;
  • Atomicity — if the server dies between the two UPDATEs, atomicity rolls back the debit. Money is never taken from Alice without reaching Bob; no $100 vanishes.
  • Consistency — a CHECK (balance >= 0) rejects the whole transfer if Alice doesn't have $100.
  • Isolation — a "total balance" report running concurrently won't catch the moment $100 has left Alice but not yet arrived at Bob; it sees the system before or after, never in-between.
  • Durability — once the bank says "transfer complete," a power failure one second later cannot lose it.
Common mistake: Doing multi-step writes (debit + credit, or order + stock decrement) as separate statements outside a transaction. A failure between them leaves a broken half-state — money debited but not credited, an order with no stock taken. Wrap related writes in one transaction.

7. Isolation levels and concurrency anomalies

Perfect isolation (every transaction acting as if it's completely alone) is expensive — it requires lots of locking, which slows everyone down. So SQL defines weaker levels that trade some isolation for speed. To choose a level you must understand the three classic read anomalies — bad things that can happen when transactions overlap:

  • Dirty read — Transaction 1 reads a row that Transaction 2 has changed but not yet committed. If T2 then rolls back, T1 acted on data that never officially existed.
  • Non-repeatable read — T1 reads a row, T2 updates and commits that same row, T1 reads it again and gets a different value. The same row changed underneath T1.
  • Phantom read — T1 runs a range query (WHERE total > 100), T2 inserts a new matching row and commits, T1 re-runs the query and now gets extra rows. The set of rows changed — new "phantoms" appeared.
 Non-repeatable read timeline
 T1                          T2
 ---------------------       ---------------------
 read row #7  -> $50
                             UPDATE row #7 = $80
                             COMMIT
 read row #7  -> $80   <- same row, different value!

The four standard isolation levels, from weakest to strongest, and which anomalies each prevents:

Isolation levelDirty readNon-repeatable readPhantom read
READ UNCOMMITTEDpossiblepossiblepossible
READ COMMITTEDpreventedpossiblepossible
REPEATABLE READpreventedpreventedpossible*
SERIALIZABLEpreventedpreventedprevented

* The bare SQL standard allows phantoms at REPEATABLE READ — but real engines below are stricter.

How real engines actually behave (frequently tested)

  • PostgreSQL default = READ COMMITTED. Postgres has no real READ UNCOMMITTED — ask for it and you get READ COMMITTED. Its REPEATABLE READ uses snapshot isolation (each transaction sees a frozen snapshot of the data) and, unlike the bare standard, also prevents phantom reads. Its SERIALIZABLE uses SSI (Serializable Snapshot Isolation): it watches for dangerous read/write dependency cycles and aborts one transaction — so your app must be ready to retry on a serialization failure.
  • MySQL/InnoDB default = REPEATABLE READ, and InnoDB's "next-key locking" largely blocks phantoms too.
  • The same level name can behave differently across engines — the SQL standard sets minimums, not exact behavior. Never assume two databases match just because the level is called the same thing.

The lost update anomaly

There's one more important hazard the standard table doesn't list: the lost update. Two transactions both read the same value, both modify it, and both write it back — the second write silently overwrites the first.

 Both shoppers read stock = 1, both sell the last unit:
 T1: read stock=1 -> set stock = 1-1 = 0  (sold!)
 T2: read stock=1 -> set stock = 1-1 = 0  (sold AGAIN!)
 Result: stock=0 but TWO units sold -> oversold.
Common mistake: Believing that simply "using a transaction" or even SERIALIZABLE at READ COMMITTED magically prevents lost updates. At READ COMMITTED it does not. You must defend explicitly with one of:
  • SELECT … FOR UPDATE — locks the row so the second transaction waits until the first commits, then reads the new value.
  • Atomic incrementUPDATE products SET stock = stock - 1 WHERE id = 7 AND stock > 0; does the read-and-write in one indivisible step.
  • Version column (optimistic concurrency) — add a version number; update only if the version hasn't changed, and retry if it has.
Best practice: Most online apps run at READ COMMITTED for throughput and reach for SELECT … FOR UPDATE or atomic updates exactly where a race matters (stock, balances, counters). Reserve SERIALIZABLE for the trickiest invariants — and always code the retry loop it needs.

8. Putting it together: orders and payments

An e-commerce checkout uses nearly everything above. The schema: customers (1:many) orders (1:many) order_items (many:1) products, plus payments (many:1 orders).

Placing an order must be ONE transaction that does all of this together:

  1. Insert the order row.
  2. Insert each order_item, snapshotting the product's name and price at that moment (so a future price edit can't rewrite this invoice).
  3. Decrement products.stock, protected by CHECK (stock >= 0) so you can never oversell.
  4. Record the payment.

Because it's one transaction, a failure anywhere (card declined, out of stock) rolls the whole thing back — no half-created orders, no stock wrongly taken. To stop two shoppers buying the last unit at the same instant, lock the stock row with SELECT … FOR UPDATE (or use the atomic WHERE stock > 0 update). And every money column is NUMERIC, never float.

Example — the last unit, done right:
BEGIN;
  -- atomic decrement: succeeds only if stock remains
  UPDATE products SET stock = stock - 1
    WHERE id = 7 AND stock > 0;
  -- if 0 rows were affected -> sold out -> ROLLBACK
  INSERT INTO orders (customer_id, total) VALUES (42, 19.99);
  INSERT INTO order_items (order_id, product_id, quantity,
                           unit_price, product_name)
    VALUES (currval, 7, 1, 19.99, 'Blue Mug');  -- snapshot!
  INSERT INTO payments (order_id, amount) VALUES (currval, 19.99);
COMMIT;
The atomic decrement plus the CHECK constraint make overselling impossible; the snapshot freezes the price; the single transaction guarantees no half-order survives a failure.
Key takeaways:
  • A DBMS gives you concurrency control, querying, integrity, crash safety, and indexes that flat files can't — use one for any real data.
  • The relational model is tables + rows + typed columns, tied together by primary keys and foreign keys; many:many always needs a join table.
  • Always write the WHERE on UPDATE/DELETE; WHERE filters rows before grouping, HAVING filters groups after; use LEFT JOIN to keep unmatched rows.
  • Store money in NUMERIC, never FLOAT; normalize to 3NF to kill harmful duplication, but snapshot order prices on purpose — that's a real fact, not a violation.
  • A transaction (BEGIN…COMMIT) gives ACID: atomicity, consistency, isolation, durability (via the WAL). ACID's "C" is about constraints, not CAP's replica consistency.
  • Weaker isolation levels allow dirty/non-repeatable/phantom reads; Postgres defaults to READ COMMITTED, MySQL/InnoDB to REPEATABLE READ; lost updates need FOR UPDATE, atomic increments, or version columns — a transaction alone won't save you.

Continue reading