Revision Cheat Sheet
By Pritesh Yadav 8 min read —
This is the 10-minute review. Each pillar distilled to its load-bearing ideas; the decision table tells you which technique fits which situation; the golden rules are what to remember when you forget everything else.
Evals in one minute
- Model eval vs product eval — model eval asks "is the model good?" (generic benchmarks); product eval asks "does my feature do its job for my users?" The second is the one that matters.
- Golden dataset — a curated, labelled set of representative inputs + expected behaviour; your ground truth. Build it from real usage and edge cases, not invented examples.
- Offline vs online — offline = measured pre-ship against fixed data; online = measured in production on live traffic (A/B, user signals). You need both: offline gates releases, online catches reality.
- Error analysis — read actual failing outputs, cluster them into failure modes, name them. This comes before metrics — you can't measure what you haven't looked at.
- LLM-as-a-judge — use a model to score outputs at scale. Cheap and fast, but must be validated against human labels before you trust it.
- Judge biases — position (favours first/last option), verbosity (favours longer answers), self-preference (favours its own model family). Mitigate: randomise order, control for length, use a different judge model.
- Binary vs Likert — binary (pass/fail) is reliable and actionable; Likert (1–5) is noisier and graders disagree. Prefer binary unless you genuinely need gradation.
- RAG-specific metrics — faithfulness (answer supported by retrieved context, no hallucination), context precision (retrieved chunks are relevant), context recall (all needed info was retrieved). Diagnose retrieval vs generation separately.
- pass@k vs pass^k — pass@k = succeeds at least once in k tries (measures capability/best-case); pass^k = succeeds in all k tries (measures reliability/consistency). Agents need pass^k thinking.
- The eval flywheel — ship → observe failures → add them to the golden set → improve → re-measure. The dataset compounds into a moat.
Retrieval in one minute
- Tokens / context window — models read/write in tokens (sub-word units); the context window is the finite token budget for prompt + output. Everything you retrieve competes for that space.
- Chunking — splitting documents into retrievable pieces. Too big = noisy/wasteful; too small = lost meaning. Chunk on semantic boundaries, keep some overlap.
- Embeddings — text mapped to vectors so semantic similarity = geometric closeness. The basis of meaning-based search.
- Vector DB / similarity search — stores embeddings and finds nearest neighbours to a query vector. Fast approximate semantic lookup.
- Hybrid search — combine semantic (vector) with keyword/lexical (BM25). Catches both meaning and exact terms (names, codes, IDs) that embeddings miss.
- Reranking — a second, more precise model reorders the top candidates from first-stage retrieval. Cheap retrieval casts a wide net; reranker tightens precision.
- RAG pipeline — retrieve relevant context → inject into prompt → generate a grounded answer. Grounds output in your data without retraining.
- Query rewriting — reshape the raw user query (expand, clarify, decompose) before retrieval so it matches how documents are written.
- Contextual retrieval — prepend chunk-level context (what document/section it came from) before embedding, so isolated chunks stay interpretable.
- GraphRAG — retrieve over a knowledge graph of entities/relationships, not just flat chunks. Better for multi-hop, "how do these connect?" questions.
- RAG vs fine-tuning vs long-context — RAG adds knowledge (facts, updatable, citable); fine-tuning teaches behaviour/format/style; long-context stuffs everything in the window (simple but costly, and degrades). Default to RAG for facts.
- "Lost in the middle" — models attend best to the start and end of a long context and miss the middle. Put the most important retrieved material at the edges; don't assume more context = better.
- Agent memory — persisting state across turns/sessions (scratchpad, summaries, retrieved history) so an agent isn't amnesiac. Memory is retrieval applied to the agent's own past.
Agents in one minute
- Agent vs workflow — a workflow is a fixed, predefined path of LLM calls; an agent dynamically decides its own steps and tool use. Workflows are predictable; agents are flexible but harder to control. Prefer a workflow when the path is known.
- The agent loop — observe → think → act (call a tool) → observe result → repeat until done. The engine under every agent.
- Tools / function calling — the model emits a structured request to call a function; your code runs it and feeds the result back. How agents touch the real world.
- ReAct — interleave Reasoning and Acting ("thought → action → observation" cycles) so the model plans, acts, and adjusts on feedback.
- Workflow patterns — prompt chaining (sequential steps), routing (classify then dispatch), parallelization (fan out, aggregate), orchestrator-worker (a lead delegates subtasks), evaluator-optimizer (generate → critique → revise loop), reflection (model reviews its own output before finalizing).
- Single vs multi-agent — multi-agent adds coordination cost and new failure surface. Use one agent until a task genuinely needs separate roles/contexts; most don't.
- MCP (Model Context Protocol) — an open standard for connecting models to tools/data sources, so integrations are reusable instead of bespoke per app.
- Failure modes — loops (agent repeats the same action forever), compounding errors (a small early mistake propagates and amplifies across steps). Long autonomous chains multiply per-step error rates.
- Reliability practices — cap iterations, validate every tool output, constrain the action space, add checkpoints/human approval for risky steps, log traces, and design for graceful failure. Build for pass^k, not pass@k.
Judgment in one minute
- The escalation ladder — code → single LLM call → RAG → agent → multi-agent. Each rung costs more (latency, money, unpredictability). Start at the bottom; climb only when the rung below provably can't do the job.
- When NOT to use AI — deterministic, exact, or correctness-critical tasks (math, lookups, rules, anything needing the same answer every time) belong in code. An LLM is the wrong tool for problems with one right answer.
- Hallucination is permanent — it's intrinsic to how LLMs work, not a bug awaiting a fix. Design around it (grounding, citation, verification), never assume it'll go away.
- Human-in-the-loop — keep a person on consequential/irreversible actions. Confidence ≠ correctness, so gate high-stakes outputs with review.
- Automation vs augmentation — automation replaces the human (needs high reliability + low blast radius); augmentation assists a human who stays in control. Default to augmentation when stakes are real.
- Designing for failure — assume the model will be wrong, slow, or unavailable. Add fallbacks, timeouts, validation, and clear user-facing error/recovery paths.
- Prompt injection — malicious instructions hidden in retrieved/user content hijack the model. Treat all external text as untrusted, separate instructions from data, and never give an agent unchecked authority over sensitive actions.
- Model-agnostic engineering — abstract the provider behind an interface so you can swap models as price/quality shifts. Don't hard-couple your product to one model's quirks.
Decision quick-reference
| Situation | Right technique |
|---|---|
| Answers must cite private / up-to-date documents | RAG (retrieve, ground, cite) |
| Need the same exact answer every time | Don't use an LLM — use code / deterministic logic |
| Output should match a fixed style, format, or behaviour | Fine-tuning (not RAG) |
| Subjective quality scored at scale | LLM-as-judge, validated against human labels first |
| Task is a known, fixed sequence of steps | Workflow (chaining / routing), not an agent |
| Open-ended, multi-step task with unknown path | Single agent with a tool loop + iteration cap |
| Keyword/exact terms (names, IDs) being missed | Hybrid search + reranking |
| Retrieval pulls the wrong or partial context | Query rewriting + contextual retrieval; check precision/recall |
| Multi-hop "how do these relate?" questions | GraphRAG |
| Irreversible or high-stakes action | Human-in-the-loop / augmentation, not automation |
| Untrusted external text feeding the prompt | Treat as data, defend against prompt injection |
| One agent overwhelmed by distinct roles/contexts | Multi-agent (only after single-agent fails) |
Golden rules
- Look at your data before you measure — error analysis precedes metrics; you can't fix what you haven't read.
- An eval you don't run is worthless — wire it into the loop or it doesn't exist.
- Measure the product, not the model — generic benchmarks don't tell you if your feature works.
- Retrieve, then generate — never trust ungrounded output for facts.
- A faithful answer over the wrong document is still wrong — fix retrieval, not just the prompt.
- Prefer the simplest rung of the ladder that works — every step up costs reliability and money.
- Validate your judge against humans before you believe its scores.
- Hallucination is permanent — design around it, don't wait for a fix.
- Optimize agents for pass^k (consistency), not pass@k (best case).
- Assume the model will fail — build the fallback before you ship the feature.