Data Engineering — Moving & Shaping Data at Scale
So far in this guide you have learned where data lives (databases), how to run software across many machines (distributed systems), and how bytes travel between computers (networking). Data engineering sits on top of all three. It is the craft of building the "plumbing" that moves, cleans, combines, and delivers data so that dashboards, business analysts, and machine-learning models can use it.
Here is the key difference between a normal software engineer and a data engineer. A software engineer builds the app that creates data — every order placed, every button click, every sign-up. A data engineer builds the systems that collect, reshape, and deliver that data to the people and programs that make decisions. The product a data engineer ships is not a screen or a feature. It is a trustworthy, on-time, well-shaped dataset.
Two kinds of work: OLTP vs OLAP
The first thing to understand is that databases serve two very different workloads — that is, two different patterns of how data gets read and written.
- OLTP (Online Transaction Processing) — the live, operational database behind your app. A "transaction" here means one small unit of business work: "insert this new order", "fetch user 42". OLTP does many tiny reads and writes, each touching just a few rows, and it needs to be fast and handle lots of users at once. Examples: PostgreSQL, MySQL.
- OLAP (Online Analytical Processing) — the analytics database. It does a few enormous queries that scan millions of rows but only look at a handful of columns. Example query: "what was the average order value by region last quarter?" Examples of OLAP systems: Snowflake, BigQuery.
(A row is one record — one whole order. A column is one field across all records — every order's price.)
Row storage vs columnar storage — and why it matters
How data is physically laid out on disk decides which workload is fast. There are two layouts:
ROW-ORIENTED (good for OLTP): [id1,name1,price1][id2,name2,price2][id3,name3,price3] ^-- one whole record sits together --^ COLUMNAR (good for OLAP): ids: [id1, id2, id3 ] names: [name1, name2, name3] prices: [price1,price2,price3] ^-- each column sits together --^
Now imagine the query SUM(price) over a million orders. On the row layout, the computer must read every byte of every record — names, addresses, everything — just to pick out the price. On the columnar layout, it reads only the price block and ignores the rest.
Columnar storage wins for analytics for three reasons that stack on top of each other:
- Less I/O (less data read from disk). A query touching 3 of 50 columns reads only those 3 — often 90%+ less data.
- Better compression. "Compression" means storing data in fewer bytes. When values of the same type sit together (all countries in one block), patterns repeat and squeeze down tightly — typically 5–10× smaller, sometimes up to ~30× on columns with few distinct values (like "country") using tricks called dictionary and run-length encoding.
- Vectorized execution. Because same-type values sit in a tidy array, the CPU can apply one instruction to many values at once (a chip feature called SIMD). Aggregations that take minutes on a row store finish in milliseconds.
The trade-off: columnar is terrible at updating a single row, because one row is now scattered across many column blocks. That is exactly why OLTP databases stay row-oriented. You can't have it both ways in one system — so you copy data from the OLTP system into an OLAP system. That copying is the heart of data engineering.
Getting data across: ETL vs ELT
Both terms describe moving data from a source into an analytics system. They differ only in the order of two steps. E = Extract (pull data out), T = Transform (clean and reshape), L = Load (put it into the warehouse).
- ETL (Extract → Transform → Load): clean and reshape the data on a separate machine first, then load the finished result. This was standard when storage and compute were expensive.
- ELT (Extract → Load → Transform): dump the raw data into cheap storage or a powerful cloud warehouse first, then transform it inside that warehouse using SQL.
ELT is now dominant. Cloud warehouses have cheap storage and elastic compute (you rent power on demand), so loading raw and transforming later is fast and flexible. Crucially, you keep the raw data, so if you find a bug in a transform or need a new metric, you can re-derive it. A popular tool, dbt, lets analysts write transforms as version-controlled SQL with built-in tests.
Where the data lives: warehouse vs lake vs lakehouse
| Type | What it is | Strength | Weakness |
|---|---|---|---|
| Data warehouse | Structured, cleaned, modeled data. Schema-on-write (data must fit a defined shape before loading). Snowflake, BigQuery, Redshift. | Fast, reliable SQL analytics for business reports. | Historically rigid and pricey for messy/unstructured data. |
| Data lake | Cheap object storage (Amazon S3, Google Cloud Storage) holding any file format. Schema-on-read (shape is decided when you query, not when you store). | Cheap, flexible, stores images/JSON/logs/anything. | With no rules it becomes a "data swamp" — no transactions, no quality, unreliable to query. |
| Lakehouse | The lake's cheap storage plus warehouse-grade management (transactions, schema rules, time travel) added by a "table format". The modern default. | Cheap and flexible and reliable. | More moving parts to understand. |
File formats vs table formats — a point everyone confuses
This distinction trips up most beginners, so go slowly.
- Parquet and ORC are columnar file formats. A Parquet file is a single, unchangeable file that stores columns together and embeds small statistics (the min and max value in each chunk) so a query engine can skip files it doesn't need. Parquet is the de facto standard.
- Delta Lake, Apache Iceberg, and Hudi are table formats. They are not a new way to store the data — they store the data as Parquet files and add a metadata / transaction-log layer on top. That layer records "which files make up this table right now" and gives you ACID transactions (all-or-nothing safe writes), schema evolution (the table's shape can change safely), and time travel (query the table as it looked last Tuesday).
LAKEHOUSE LAYER CAKE (read bottom to top)
Query engines | Spark, Trino, Snowflake, DuckDB
----------------+---------------------------------
Table format | Iceberg / Delta: ACID, time
| travel, schema, "which files?"
----------------+---------------------------------
File format | Parquet (columnar files + stats)
----------------+---------------------------------
Object storage | S3 / GCS (cheap, holds the files)
Delta Lake is centered on Spark/Databricks. Iceberg is engine-agnostic — Spark, Trino, Flink, Snowflake, and DuckDB can all read the same Iceberg table, which is its defining advantage. In 2025 the two converged further (Databricks' Delta "UniForm" can expose a Delta table as Iceberg).
Batch vs stream — and "the log" that unifies them
- Batch processing: process a finite, bounded chunk of data on a schedule — "all of yesterday's orders, every night at 2am". High throughput, higher latency. Main tool: Apache Spark.
- Stream processing: process an endless flow of events as they arrive, within seconds or less. Tools: Apache Kafka (the pipe that carries events) plus a processor like Apache Flink.
There is a deep idea here, made famous by Martin Kleppmann in Designing Data-Intensive Applications: an append-only, ordered log unifies batch and streaming. A "log" here means a record book you only ever add to the end of, never edit. A batch is just a bounded slice of a stream. If every change is an immutable, ordered, replayable event in a durable log, then real-time consumers and historical reprocessing can run the exact same logic — you just choose where in the log to start reading.
Kafka basics — the log in practice
Kafka is a distributed, durable, append-only commit log. Unlike a traditional queue, messages are not deleted when read — they are kept so many readers can re-read them.
- Topic — a named stream of records, e.g. "orders".
- Partition — a topic is split into ordered shards. Order is guaranteed only within a partition, never across the whole topic. Partitions are the unit of parallelism. Records with the same key (e.g. same customer ID) always land in the same partition, so per-customer order is preserved.
- Offset — a record's sequential position within a partition. A consumer's offset is how far it has read. Kafka stores committed offsets in an internal topic called
__consumer_offsets. - Consumer group — a team of consumers splitting a topic's partitions. Each partition is read by exactly one consumer in the group. More partitions → more parallel consumers.
Topic "orders" Partition 0: [off0][off1][off2] -> Consumer A Partition 1: [off0][off1] -> Consumer B Partition 2: [off0] -> Consumer C (add a 4th consumer -> it sits IDLE, no partition left)
Delivery semantics. Kafka's default is at-least-once: a consumer processes a record, then saves (commits) its offset. If it crashes in between, that record is delivered again and processed twice. (At-most-once commits first and risks losing data; exactly-once is possible but harder and costlier.) The practical consequence is huge: because duplicates can happen, everything downstream must be idempotent.
Reliability disciplines: idempotency, replay, backfills, schema, quality
These disciplines separate a toy script from production data engineering.
Idempotency means: running the same input twice gives the same result — no duplicates, no double-counts. The simplest way to get it is to overwrite a target chunk instead of blindly adding to it.
DELETE FROM orders WHERE day='2026-06-20'; INSERT yesterday's orders; can be run 5 times and the table is identical every time. Compare a naive INSERT yesterday's orders with no delete — run it twice and every order is double-counted. The first is idempotent; the second is a time bomb.Replayability — because the raw data/log is retained, you can re-run a pipeline from a past point to fix a bug or rebuild a downstream table. Backfilling is exactly this: re-populating or correcting historical data (you added a column, fixed a transform, or onboarded a new source). A well-built pipeline backfills by re-running one isolated, idempotent period at a time (e.g. dbt's microbatch strategy keyed on an event_time column), so reprocessing a date range never double-counts.
Schema evolution — sources change shape over time. Additive changes (a new optional column) should be absorbed automatically. Breaking changes (a renamed, removed, or retyped column) should halt the pipeline and alert a human — never silently drop the field, or your dashboard quietly shows wrong numbers.
Data quality testing — validate at every stage: null-rate thresholds, uniqueness/primary-key checks, referential integrity (every order points to a real customer), value-range checks (price > 0), and row-count anomaly detection (today's row count shouldn't be 10× yesterday's). Tools: dbt tests and Great Expectations.
Orchestration: Airflow and DAGs
A real pipeline is many steps that must run in the right order, on a schedule, with retries when something fails. An orchestrator coordinates this. The classic tool is Apache Airflow, where you describe your pipeline as a DAG (Directed Acyclic Graph) in Python. "Directed" = arrows point one way (this step before that). "Acyclic" = no loops, so no step can depend on itself.
extract_orders
|
+--+--+
v v
validate dedupe
| |
+--+--+
v
load_warehouse -> build_marts -> refresh_dashboard
(arrows = dependencies; no cycles allowed)
Partitioning data files
In lakes and lakehouses, big tables are physically split into folders by a column — usually a date:
s3://bucket/orders/year=2026/month=06/day=20/part-000.parquet A query "WHERE day='2026-06-20'" reads ONLY that folder and skips every other day -> "partition pruning"
This is the file-level cousin of a database index. Best practice: partition on a low-cardinality column you actually filter on (a date is ideal — few distinct values, frequently filtered).
user_id or by the hour on tiny data. You end up with millions of tiny files (the "small-files problem"), which crushes query performance. Keep partitions reasonably large; usually partition by date.Why this underpins AI and machine learning
Models are only as good as the data feeding them. Data engineering is the supply chain that makes ML possible. Feature pipelines (often organized in a feature store) compute the inputs a model uses — for instance "average order value over the last 7 days". Training datasets are assembled by exactly the batch jobs described here, run over warehouse/lakehouse tables; replayability lets you reproduce the precise dataset a model was trained on.
- Data engineering builds the reliable pipelines that move, clean, and serve data; it sits on top of databases, distributed systems, and networking.
- OLTP is row-oriented for many small transactions; OLAP is columnar for big analytical scans (less I/O + better compression + vectorized execution). Don't run analytics on your production OLTP database — copy data out.
- ELT (load raw first, transform in the warehouse) beat ETL thanks to cheap cloud storage and tools like dbt. Parquet is a file format; Iceberg/Delta are table formats that wrap Parquet with ACID, time travel, and schema evolution — the lakehouse.
- An append-only, replayable log (Kafka) unifies batch and streaming. Kafka guarantees order only within a partition and delivers at-least-once, so downstream must be idempotent.
- Idempotency, replay, careful schema evolution, data-quality tests, and partition pruning are what make pipelines safe to retry and backfill. Airflow orchestrates the steps but Spark/Flink/dbt do the work.
- This whole layer is the supply chain for AI/ML — feature stores and shared log logic prevent training/serving skew and make datasets reproducible.