Revision Cheat Sheet

By Pritesh Yadav 8 min read

This is the 10-minute review. Each pillar distilled to its load-bearing ideas; the decision table tells you which technique fits which situation; the golden rules are what to remember when you forget everything else.

Evals in one minute

  • Model eval vs product eval — model eval asks "is the model good?" (generic benchmarks); product eval asks "does my feature do its job for my users?" The second is the one that matters.
  • Golden dataset — a curated, labelled set of representative inputs + expected behaviour; your ground truth. Build it from real usage and edge cases, not invented examples.
  • Offline vs online — offline = measured pre-ship against fixed data; online = measured in production on live traffic (A/B, user signals). You need both: offline gates releases, online catches reality.
  • Error analysis — read actual failing outputs, cluster them into failure modes, name them. This comes before metrics — you can't measure what you haven't looked at.
  • LLM-as-a-judge — use a model to score outputs at scale. Cheap and fast, but must be validated against human labels before you trust it.
  • Judge biasesposition (favours first/last option), verbosity (favours longer answers), self-preference (favours its own model family). Mitigate: randomise order, control for length, use a different judge model.
  • Binary vs Likert — binary (pass/fail) is reliable and actionable; Likert (1–5) is noisier and graders disagree. Prefer binary unless you genuinely need gradation.
  • RAG-specific metricsfaithfulness (answer supported by retrieved context, no hallucination), context precision (retrieved chunks are relevant), context recall (all needed info was retrieved). Diagnose retrieval vs generation separately.
  • pass@k vs pass^k — pass@k = succeeds at least once in k tries (measures capability/best-case); pass^k = succeeds in all k tries (measures reliability/consistency). Agents need pass^k thinking.
  • The eval flywheel — ship → observe failures → add them to the golden set → improve → re-measure. The dataset compounds into a moat.

Retrieval in one minute

  • Tokens / context window — models read/write in tokens (sub-word units); the context window is the finite token budget for prompt + output. Everything you retrieve competes for that space.
  • Chunking — splitting documents into retrievable pieces. Too big = noisy/wasteful; too small = lost meaning. Chunk on semantic boundaries, keep some overlap.
  • Embeddings — text mapped to vectors so semantic similarity = geometric closeness. The basis of meaning-based search.
  • Vector DB / similarity search — stores embeddings and finds nearest neighbours to a query vector. Fast approximate semantic lookup.
  • Hybrid search — combine semantic (vector) with keyword/lexical (BM25). Catches both meaning and exact terms (names, codes, IDs) that embeddings miss.
  • Reranking — a second, more precise model reorders the top candidates from first-stage retrieval. Cheap retrieval casts a wide net; reranker tightens precision.
  • RAG pipeline — retrieve relevant context → inject into prompt → generate a grounded answer. Grounds output in your data without retraining.
  • Query rewriting — reshape the raw user query (expand, clarify, decompose) before retrieval so it matches how documents are written.
  • Contextual retrieval — prepend chunk-level context (what document/section it came from) before embedding, so isolated chunks stay interpretable.
  • GraphRAG — retrieve over a knowledge graph of entities/relationships, not just flat chunks. Better for multi-hop, "how do these connect?" questions.
  • RAG vs fine-tuning vs long-context — RAG adds knowledge (facts, updatable, citable); fine-tuning teaches behaviour/format/style; long-context stuffs everything in the window (simple but costly, and degrades). Default to RAG for facts.
  • "Lost in the middle" — models attend best to the start and end of a long context and miss the middle. Put the most important retrieved material at the edges; don't assume more context = better.
  • Agent memory — persisting state across turns/sessions (scratchpad, summaries, retrieved history) so an agent isn't amnesiac. Memory is retrieval applied to the agent's own past.

Agents in one minute

  • Agent vs workflow — a workflow is a fixed, predefined path of LLM calls; an agent dynamically decides its own steps and tool use. Workflows are predictable; agents are flexible but harder to control. Prefer a workflow when the path is known.
  • The agent loop — observe → think → act (call a tool) → observe result → repeat until done. The engine under every agent.
  • Tools / function calling — the model emits a structured request to call a function; your code runs it and feeds the result back. How agents touch the real world.
  • ReAct — interleave Reasoning and Acting ("thought → action → observation" cycles) so the model plans, acts, and adjusts on feedback.
  • Workflow patternsprompt chaining (sequential steps), routing (classify then dispatch), parallelization (fan out, aggregate), orchestrator-worker (a lead delegates subtasks), evaluator-optimizer (generate → critique → revise loop), reflection (model reviews its own output before finalizing).
  • Single vs multi-agent — multi-agent adds coordination cost and new failure surface. Use one agent until a task genuinely needs separate roles/contexts; most don't.
  • MCP (Model Context Protocol) — an open standard for connecting models to tools/data sources, so integrations are reusable instead of bespoke per app.
  • Failure modesloops (agent repeats the same action forever), compounding errors (a small early mistake propagates and amplifies across steps). Long autonomous chains multiply per-step error rates.
  • Reliability practices — cap iterations, validate every tool output, constrain the action space, add checkpoints/human approval for risky steps, log traces, and design for graceful failure. Build for pass^k, not pass@k.

Judgment in one minute

  • The escalation ladder — code → single LLM call → RAG → agent → multi-agent. Each rung costs more (latency, money, unpredictability). Start at the bottom; climb only when the rung below provably can't do the job.
  • When NOT to use AI — deterministic, exact, or correctness-critical tasks (math, lookups, rules, anything needing the same answer every time) belong in code. An LLM is the wrong tool for problems with one right answer.
  • Hallucination is permanent — it's intrinsic to how LLMs work, not a bug awaiting a fix. Design around it (grounding, citation, verification), never assume it'll go away.
  • Human-in-the-loop — keep a person on consequential/irreversible actions. Confidence ≠ correctness, so gate high-stakes outputs with review.
  • Automation vs augmentation — automation replaces the human (needs high reliability + low blast radius); augmentation assists a human who stays in control. Default to augmentation when stakes are real.
  • Designing for failure — assume the model will be wrong, slow, or unavailable. Add fallbacks, timeouts, validation, and clear user-facing error/recovery paths.
  • Prompt injection — malicious instructions hidden in retrieved/user content hijack the model. Treat all external text as untrusted, separate instructions from data, and never give an agent unchecked authority over sensitive actions.
  • Model-agnostic engineering — abstract the provider behind an interface so you can swap models as price/quality shifts. Don't hard-couple your product to one model's quirks.

Decision quick-reference

SituationRight technique
Answers must cite private / up-to-date documentsRAG (retrieve, ground, cite)
Need the same exact answer every timeDon't use an LLM — use code / deterministic logic
Output should match a fixed style, format, or behaviourFine-tuning (not RAG)
Subjective quality scored at scaleLLM-as-judge, validated against human labels first
Task is a known, fixed sequence of stepsWorkflow (chaining / routing), not an agent
Open-ended, multi-step task with unknown pathSingle agent with a tool loop + iteration cap
Keyword/exact terms (names, IDs) being missedHybrid search + reranking
Retrieval pulls the wrong or partial contextQuery rewriting + contextual retrieval; check precision/recall
Multi-hop "how do these relate?" questionsGraphRAG
Irreversible or high-stakes actionHuman-in-the-loop / augmentation, not automation
Untrusted external text feeding the promptTreat as data, defend against prompt injection
One agent overwhelmed by distinct roles/contextsMulti-agent (only after single-agent fails)

Golden rules

  • Look at your data before you measure — error analysis precedes metrics; you can't fix what you haven't read.
  • An eval you don't run is worthless — wire it into the loop or it doesn't exist.
  • Measure the product, not the model — generic benchmarks don't tell you if your feature works.
  • Retrieve, then generate — never trust ungrounded output for facts.
  • A faithful answer over the wrong document is still wrong — fix retrieval, not just the prompt.
  • Prefer the simplest rung of the ladder that works — every step up costs reliability and money.
  • Validate your judge against humans before you believe its scores.
  • Hallucination is permanent — design around it, don't wait for a fix.
  • Optimize agents for pass^k (consistency), not pass@k (best case).
  • Assume the model will fail — build the fallback before you ship the feature.

Continue reading