Revision Cheat Sheet

By Pritesh Yadav June 21, 2026 8 min read —

This is the 10-minute review. Each pillar distilled to its load-bearing ideas; the decision table tells you which technique fits which situation; the golden rules are what to remember when you forget everything else.

Evals in one minute

Model eval vs product eval — model eval asks "is the model good?" (generic benchmarks); product eval asks "does my feature do its job for my users?" The second is the one that matters.
Golden dataset — a curated, labelled set of representative inputs + expected behaviour; your ground truth. Build it from real usage and edge cases, not invented examples.
Offline vs online — offline = measured pre-ship against fixed data; online = measured in production on live traffic (A/B, user signals). You need both: offline gates releases, online catches reality.
Error analysis — read actual failing outputs, cluster them into failure modes, name them. This comes before metrics — you can't measure what you haven't looked at.
LLM-as-a-judge — use a model to score outputs at scale. Cheap and fast, but must be validated against human labels before you trust it.
Judge biases — position (favours first/last option), verbosity (favours longer answers), self-preference (favours its own model family). Mitigate: randomise order, control for length, use a different judge model.
Binary vs Likert — binary (pass/fail) is reliable and actionable; Likert (1–5) is noisier and graders disagree. Prefer binary unless you genuinely need gradation.
RAG-specific metrics — faithfulness (answer supported by retrieved context, no hallucination), context precision (retrieved chunks are relevant), context recall (all needed info was retrieved). Diagnose retrieval vs generation separately.
pass@k vs pass^k — pass@k = succeeds at least once in k tries (measures capability/best-case); pass^k = succeeds in all k tries (measures reliability/consistency). Agents need pass^k thinking.
The eval flywheel — ship → observe failures → add them to the golden set → improve → re-measure. The dataset compounds into a moat.

Retrieval in one minute

Tokens / context window — models read/write in tokens (sub-word units); the context window is the finite token budget for prompt + output. Everything you retrieve competes for that space.
Chunking — splitting documents into retrievable pieces. Too big = noisy/wasteful; too small = lost meaning. Chunk on semantic boundaries, keep some overlap.
Embeddings — text mapped to vectors so semantic similarity = geometric closeness. The basis of meaning-based search.
Vector DB / similarity search — stores embeddings and finds nearest neighbours to a query vector. Fast approximate semantic lookup.
Hybrid search — combine semantic (vector) with keyword/lexical (BM25). Catches both meaning and exact terms (names, codes, IDs) that embeddings miss.
Reranking — a second, more precise model reorders the top candidates from first-stage retrieval. Cheap retrieval casts a wide net; reranker tightens precision.
RAG pipeline — retrieve relevant context → inject into prompt → generate a grounded answer. Grounds output in your data without retraining.
Query rewriting — reshape the raw user query (expand, clarify, decompose) before retrieval so it matches how documents are written.
Contextual retrieval — prepend chunk-level context (what document/section it came from) before embedding, so isolated chunks stay interpretable.
GraphRAG — retrieve over a knowledge graph of entities/relationships, not just flat chunks. Better for multi-hop, "how do these connect?" questions.
RAG vs fine-tuning vs long-context — RAG adds knowledge (facts, updatable, citable); fine-tuning teaches behaviour/format/style; long-context stuffs everything in the window (simple but costly, and degrades). Default to RAG for facts.
"Lost in the middle" — models attend best to the start and end of a long context and miss the middle. Put the most important retrieved material at the edges; don't assume more context = better.
Agent memory — persisting state across turns/sessions (scratchpad, summaries, retrieved history) so an agent isn't amnesiac. Memory is retrieval applied to the agent's own past.

Agents in one minute

Agent vs workflow — a workflow is a fixed, predefined path of LLM calls; an agent dynamically decides its own steps and tool use. Workflows are predictable; agents are flexible but harder to control. Prefer a workflow when the path is known.
The agent loop — observe → think → act (call a tool) → observe result → repeat until done. The engine under every agent.
Tools / function calling — the model emits a structured request to call a function; your code runs it and feeds the result back. How agents touch the real world.
ReAct — interleave Reasoning and Acting ("thought → action → observation" cycles) so the model plans, acts, and adjusts on feedback.
Workflow patterns — prompt chaining (sequential steps), routing (classify then dispatch), parallelization (fan out, aggregate), orchestrator-worker (a lead delegates subtasks), evaluator-optimizer (generate → critique → revise loop), reflection (model reviews its own output before finalizing).
Single vs multi-agent — multi-agent adds coordination cost and new failure surface. Use one agent until a task genuinely needs separate roles/contexts; most don't.
MCP (Model Context Protocol) — an open standard for connecting models to tools/data sources, so integrations are reusable instead of bespoke per app.
Failure modes — loops (agent repeats the same action forever), compounding errors (a small early mistake propagates and amplifies across steps). Long autonomous chains multiply per-step error rates.
Reliability practices — cap iterations, validate every tool output, constrain the action space, add checkpoints/human approval for risky steps, log traces, and design for graceful failure. Build for pass^k, not pass@k.

Judgment in one minute

The escalation ladder — code → single LLM call → RAG → agent → multi-agent. Each rung costs more (latency, money, unpredictability). Start at the bottom; climb only when the rung below provably can't do the job.
When NOT to use AI — deterministic, exact, or correctness-critical tasks (math, lookups, rules, anything needing the same answer every time) belong in code. An LLM is the wrong tool for problems with one right answer.
Hallucination is permanent — it's intrinsic to how LLMs work, not a bug awaiting a fix. Design around it (grounding, citation, verification), never assume it'll go away.
Human-in-the-loop — keep a person on consequential/irreversible actions. Confidence ≠ correctness, so gate high-stakes outputs with review.
Automation vs augmentation — automation replaces the human (needs high reliability + low blast radius); augmentation assists a human who stays in control. Default to augmentation when stakes are real.
Designing for failure — assume the model will be wrong, slow, or unavailable. Add fallbacks, timeouts, validation, and clear user-facing error/recovery paths.
Prompt injection — malicious instructions hidden in retrieved/user content hijack the model. Treat all external text as untrusted, separate instructions from data, and never give an agent unchecked authority over sensitive actions.
Model-agnostic engineering — abstract the provider behind an interface so you can swap models as price/quality shifts. Don't hard-couple your product to one model's quirks.

Decision quick-reference

Situation	Right technique
Answers must cite private / up-to-date documents	RAG (retrieve, ground, cite)
Need the same exact answer every time	Don't use an LLM — use code / deterministic logic
Output should match a fixed style, format, or behaviour	Fine-tuning (not RAG)
Subjective quality scored at scale	LLM-as-judge, validated against human labels first
Task is a known, fixed sequence of steps	Workflow (chaining / routing), not an agent
Open-ended, multi-step task with unknown path	Single agent with a tool loop + iteration cap
Keyword/exact terms (names, IDs) being missed	Hybrid search + reranking
Retrieval pulls the wrong or partial context	Query rewriting + contextual retrieval; check precision/recall
Multi-hop "how do these relate?" questions	GraphRAG
Irreversible or high-stakes action	Human-in-the-loop / augmentation, not automation
Untrusted external text feeding the prompt	Treat as data, defend against prompt injection
One agent overwhelmed by distinct roles/contexts	Multi-agent (only after single-agent fails)

Golden rules

Look at your data before you measure — error analysis precedes metrics; you can't fix what you haven't read.
An eval you don't run is worthless — wire it into the loop or it doesn't exist.
Measure the product, not the model — generic benchmarks don't tell you if your feature works.
Retrieve, then generate — never trust ungrounded output for facts.
A faithful answer over the wrong document is still wrong — fix retrieval, not just the prompt.
Prefer the simplest rung of the ladder that works — every step up costs reliability and money.
Validate your judge against humans before you believe its scores.
Hallucination is permanent — design around it, don't wait for a fix.
Optimize agents for pass^k (consistency), not pass@k (best case).
Assume the model will fail — build the fallback before you ship the feature.

Continue reading

🧠 AI & LLM Engineering

Foundations — How LLMs Work & Why These Skills Endure

🧠 AI & LLM Engineering

Evaluation & Measurement

🧠 AI & LLM Engineering

Context Engineering & Retrieval