Foundations — How LLMs Work & Why These Skills Endure
Welcome to Durable AI/LLM Engineering. This first chapter assumes you know almost nothing about artificial intelligence, and by the end of it you will understand what a Large Language Model actually does, the words engineers use to describe it, and — most importantly — why the skills this guide teaches will still be valuable years from now, even as the models themselves get replaced every few months. Read this chapter slowly. Everything later in the guide builds on the vocabulary and mental map you assemble here.
What a Large Language Model Is
A Large Language Model (LLM) is a computer program that has read an enormous amount of text — books, websites, code, conversations — and learned the statistical patterns of how words follow one another. "Large" refers to its size: modern LLMs contain hundreds of billions of internal numbers, called parameters (the adjustable knobs the model tuned during training to capture those patterns). "Language Model" means its one core job is to model language: given some text, predict what text is likely to come next.
That is genuinely all an LLM does at its core. It does not "look things up" in a database, it does not "think" the way you do, and it has no live connection to the internet unless an engineer wires one up. It is a very sophisticated pattern-completion engine.
Analogy: An LLM is like a phone's autocomplete that went to university for a decade. Your phone suggests the next word after "Happy birthday to...". An LLM does the same thing, but it has absorbed so much text that it can complete "Write a polite email declining a meeting because..." or "Here is a Python function that sorts a list:" with fluent, useful results — one word at a time.
How it generates text: next-token prediction
LLMs do not work with whole words. They work with tokens — small chunks of text, roughly 3 to 4 characters or about three-quarters of an English word. The word "running" might be one token; "antidisestablishmentarianism" might be five. The model first breaks your input into tokens, then predicts the single most likely next token, adds it to the text, and repeats the whole process again. This loop is called autoregressive generation ("auto" = self, "regressive" = feeding its own output back in).
Input text: "The capital of France is"
|
v
[ break into tokens ]
|
v
"The" "capital" "of" "France" "is"
|
v
+-------------------------------+
| LLM predicts next token |
| Paris -> 91% likely |
| the -> 3% likely |
| a -> 2% likely |
| London -> 0.4% likely |
+-------------------------------+
|
v
append "Paris", then repeat the whole loop
("The capital of France is Paris" -> predict next...)
Notice the model produces a list of probabilities, not a single answer. It might be 91% sure the next token is "Paris". A setting called temperature controls how the model picks from that list. Low temperature (e.g. 0) means "always grab the most likely token" — output is focused, repeatable, and good for facts or code. High temperature (e.g. 0.9) means "sometimes pick a less likely token" — output is more varied and creative, but more prone to going off the rails. Temperature is a creativity dial, not a correctness dial.
The context window
The context window is the maximum amount of text — measured in tokens — that the model can "see" at once. It includes your instructions, any documents you paste in, the ongoing conversation, and the answer it is producing. Think of it as the model's short-term working memory. Anything outside the window simply does not exist to the model.
Analogy: The context window is the model's desk, not its filing cabinet. Whatever papers fit on the desk, it can read instantly. Anything in a cabinet down the hall (last week's chat, a 500-page manual) is invisible until you fetch it and place it on the desk. The model has no long-term memory of its own.
Windows have grown dramatically. As of early 2026, frontier models like Claude Opus 4.6 and Gemini 3.1 Pro offer roughly 1 million tokens of context (and some open-weight models advertise up to 10 million). But here is a crucial, frequently-ignored fact:
What "inference" means
Training an LLM (teaching it the patterns) is a one-time, hugely expensive process done by a handful of labs. Inference is what happens every time you use the finished model: you send text in, and it runs the prediction loop to produce text out. When you call an LLM in an app, you are paying for inference, typically priced per token of input and per token of output. As an engineer, you almost never train models — you orchestrate inference. That distinction shapes this entire guide.
What "AI/LLM Engineering" Is — and the Durable Pillars
AI/LLM engineering is the discipline of building reliable software systems around these unreliable, probabilistic models. It is not data science (building models) and it is not casual chatbot use. It is the work of turning a model that is "right most of the time" into a product that is trustworthy enough to put in front of real users and real money.
Here is the thesis of this entire guide, stated plainly:
Analogy: Anyone can pour water from a tap (call an API). The skilled trade is the plumbing — making sure clean water reaches the right place, that nothing leaks, and that you can tell when something is contaminated. Everyone can call an LLM API in three lines of code. Very few can tell whether the output is actually any good, or build a system that stays good.
That gap — between calling a model and knowing the result is correct, safe, and useful — is where durable value lives. To see why, separate two kinds of knowledge:
| "Prompt tricks" (rot quickly) | Durable engineering skills (endure) |
|---|---|
| Magic phrases like "you are an expert", "take a deep breath", "I'll tip you $100" | Knowing how to measure whether a change helped at all |
| Model-specific formatting quirks | Designing what information reaches the model and when |
| One-off wording that worked on GPT-4 last year | Structuring multi-step systems that recover from errors |
| Tied to a specific model version; breaks on the next release | Tied to fundamentals of reliable systems; survives model swaps |
Prompt tricks are folklore tuned to one model's temporary behavior. When the model is replaced — and it will be, every few months — the trick may stop working or even backfire. The four pillars below are different: they are about the system around the model, so they carry over no matter which model you plug in.
The four durable pillars
- Evaluation & measurement — building tests ("evals") that tell you, objectively, whether your AI feature is good and whether a change made it better or worse. This is the discipline that closes the "can you tell if the output is good?" gap. It is Pillar 1 because without measurement, every other improvement is a guess.
- Context engineering & retrieval — deciding precisely what goes into that limited context window: which instructions, which documents, which past messages. Retrieval (often called RAG, Retrieval-Augmented Generation) means fetching the few relevant facts from a large store and placing them on the model's "desk" at the right moment, instead of hoping the model already memorized them.
- Agent architecture & orchestration — going beyond a single question-and-answer to systems where the model can take multiple steps, use tools (call a calculator, search a database, hit an API), check its own work, and recover from failures. Orchestration is the conductor logic coordinating those steps.
- AI product judgment — the human wisdom to decide what to build, where the model is allowed to be wrong, when a person must stay in the loop, and how to design an honest experience around a tool that sometimes confidently makes things up (a failure mode called hallucination).
How the Four Pillars Fit Together
The pillars are not four separate topics — they form one pipeline that turns a raw model call into a dependable product. Read the diagram from left to right as the path of a single user request, with two pillars wrapping the whole thing.
( PILLAR 4: AI PRODUCT JUDGMENT )
decides what to build & where the
model is allowed to be wrong
________________________________________________________________
| |
| User |
| request |
| | |
| v |
| +------------------+ PILLAR 2 |
| | Context assembly |<---- Retrieval (RAG): fetch the few |
| | (fill the window:| relevant docs/facts from a store |
| | instructions + | |
| | retrieved facts | |
| | + history) | |
| +------------------+ |
| | |
| v |
| +------------------+ PILLAR 3 |
| | LLM inference |<---> Agent loop: use tools, take steps, |
| | (next-token gen) | check work, retry on failure |
| +------------------+ |
| | |
| v |
| Answer / action |
| | |
| v |
| +------------------+ PILLAR 1 |
| | EVALUATION |----> "Is this output actually good?" |
| | (evals & scores) | feeds findings back to improve |
| +------------------+ every box above |
|________________________________________________________________|
^ |
|______________ improvement loop ______________|
(run evals -> find the weak box -> fix one thing -> re-run)
In words: Product judgment (4) decides the feature is worth building and sets the rules of the game. For each request, context engineering (2) fills the model's limited window with the right instructions and freshly retrieved facts. The model runs inference, often inside an agent loop (3) that lets it use tools and take multiple steps. Finally, evaluation (1) scores the result and feeds what it learns back into every earlier box — the disciplined "run evals, find the mismatch, change one thing, re-run" loop that is the engine of real AI engineering.
How to Use This Guide
This guide runs beginner to advanced. Each pillar gets its own deep chapter, in the order they appear above, because each builds on the last: you cannot improve what you cannot measure (evaluation first), you cannot reason about agents until you understand context, and product judgment ties everything together at the end.
- Read in order the first time. The vocabulary compounds — terms defined in early chapters are used freely later.
- Do the exercises. AI engineering is a hands-on craft. Reading about evals is not the same as writing one.
- Watch for the callout boxes. Best practice boxes tell you what to do; Common mistake boxes flag traps that have burned real teams; Key takeaway boxes (one per section) are your revision summary.
- Stay model-agnostic. Examples will mention specific models for concreteness, but the lessons are designed to outlive any one of them. When a model name appears, focus on the technique, not the brand.
- Keep a glossary. Jot down each bolded term as you meet it. By the end you will have a working vocabulary of the field.
Section summary
- An LLM is a next-token prediction engine: it splits text into tokens and predicts likely continuations one token at a time.
- The context window is the model's limited short-term memory; bigger windows still degrade before their advertised limit, so what you put in them matters.
- Temperature controls randomness/creativity, not correctness; inference is the act of running a finished model, which is what engineers orchestrate.
- AI/LLM engineering is building reliable systems around unreliable, probabilistic models.
- The core thesis: everyone can call an API; few can tell whether the output is good — that gap is where durable value lives.
- "Prompt tricks" rot with each model release; the four pillars endure because they describe the system around the model.
- The four durable pillars — Evaluation, Context engineering & retrieval, Agent architecture & orchestration, and Product judgment — form one feedback loop, not a list of separate topics.
- Use the guide in order, hands-on, model-agnostic, anchored to one real project of your own.