Evaluation & Measurement

By Pritesh Yadav 18 min read

Imagine you change one sentence in the instructions you give an AI chatbot. Did the chatbot get better, or worse? With normal software you could run your tests and get a clear answer. With AI, the same input can produce different outputs every time, and there is rarely one "correct" answer — so how do you know? This is the question that evaluation (or evals for short) exists to answer.

An eval is a systematic, repeatable measurement of how good an AI system's outputs are, judged against criteria you define in advance. It is the AI-era equivalent of a software test suite — except instead of "assert that the result equals 42," you grade probabilistic, open-ended outputs. The practitioners who built real LLM (Large Language Model — an AI trained to produce text) products almost all converge on one lesson: the teams that succeed build a robust evaluation system, and the teams that fail collapse into endless "whack-a-mole" debugging because they have no objective way to tell whether a change helped or hurt.

What evals are and why they matter

Without evals, you are flying blind. You "eyeball" a handful of outputs, they look fine, you ship — and then a customer hits an edge case you never checked. You fix that, and unknowingly break two other things. This is the balloon-squeezing trap that consumes teams who measure by gut feeling.

Analogy: Evals are to an LLM app what a test suite is to ordinary software — except the "assert equals" is replaced by graded, probabilistic judgments, because there is rarely a single exact right answer. Just as you would not ship a refactor without running your tests, you should not ship a prompt change without running your eval set.

A crucial early distinction: model evals versus product evals. A model eval (also called a benchmark) measures a model's general capability in isolation — standardized tests like MMLU (general knowledge) or GSM8K (grade-school math). A product/task eval measures whether your specific system does your specific job — for example, "does the support bot extract the correct order number 95% of the time?" The hard truth, documented by practitioners like Eugene Yan, is that leaderboard scores barely predict your application's real-world performance. A model that tops a benchmark can still flunk your extraction task.

Analogy: Model evals vs product evals is like a chef's culinary-school grade vs whether THIS chef makes good food at YOUR restaurant for YOUR customers. A high grade does not guarantee your diners are happy.
Key takeaway: Evals replace gut feeling with a repeatable measurement you run on every change. Benchmark scores rank models in general; only your own task-specific evals tell you whether your product works.

The foundation: define "good," then look at your data

You cannot measure quality until you have defined it. Anthropic's guidance is to write success criteria that are Specific, Measurable, Achievable, and Relevant — not "classify sentiment well," but "F1 ≥ 0.85 on a held-out set of 10,000 diverse tweets, a 5% improvement over baseline." (F1 is a single score that balances catching real positives against avoiding false alarms.) Most real systems need multidimensional criteria — correctness AND tone AND format AND safety AND latency AND cost — because a single number hides trade-offs.

To grade against those criteria you need ground truth: a golden dataset of inputs paired with their correct expected outputs, labeled by a trusted source (usually a domain expert). This is your answer key. Most eval failures trace back to a missing, tiny, or unrepresentative golden set.

Analogy: A golden dataset is the answer key to an exam. Without it you can grade vibes but never assign an objective score, and two graders will disagree forever.

But the single highest-ROI activity is more mundane than any of this: look at your data. Practitioners report spending 60–80% of development time reading real traces (a trace is the full record of one request: the input, any retrieved documents, tool calls, intermediate steps, and the final output) and doing error analysis — categorizing failures by root cause. Generic off-the-shelf metrics like "helpfulness" or "coherence" are a trap; they create an illusion of confidence without telling you what to fix. Real evals are built from the specific failure modes you discover by reading your own traces.

The disciplined version of error analysis borrows from qualitative research:

  1. Open coding: one expert reads traces and writes free-form notes on what went wrong — no fixed schema. Crucially, log only the first failure in each trace, because an upstream error (bad retrieval) cascades into misleading downstream symptoms (a confused, rude answer).
  2. Axial coding: cluster those notes into a small taxonomy of named categories, then count how often each occurs (e.g. "retrieval miss 40%, ignored constraint 30%, hallucinated policy 20%").
  3. Stop at theoretical saturation: when ~20 traces in a row reveal no new failure category (typically after reviewing ~100), you have sampled enough.

The counts tell you where to spend effort. And often the fix is trivial: if the model was simply never told to do something, edit the instructions — do not build an elaborate eval for a one-line gap.

Common mistake: Reaching for tools, frameworks, or an LLM-judge before looking at your data. You end up measuring abstract qualities that do not matter and never see your real failure modes. Look at the data first; the metric is downstream of the diagnosis.
Key takeaway: Define "good" up front in Specific/Measurable terms, build a representative golden dataset, and spend most of your effort reading real traces and bucketing failures by root cause — that error analysis decides which evals are even worth writing.

The three grader families — pick the cheapest reliable one

Once you know what to measure, you need a way to grade each output. There are three families, ordered from cheapest/most reliable to most expensive/most flexible. The golden rule: use the cheapest grader that is reliable enough for the criterion, and layer them.

Grader familyHow it worksStrengthsWeaknessesUse for
Code-based / deterministicPlain code: exact match, regex, JSON-schema validation, assertions ("contains the SKU?", "valid JSON?", "returns N results?")Fast, free, reproducible, runs on every commitRigid — only works when correctness is mechanically checkableStructure, format, IDs, counts, tool-call correctness
Reference-based metricsCompare output to a gold answer: exact match, BLEU/ROUGE (word overlap), embedding similarity (BERTScore)Captures semantic similarity (BERTScore); needs only a referencen-gram metrics penalize valid paraphrases ("car" vs "automobile"); none judge factuality or task successClosed-form answers, "are two answers semantically the same?"
LLM-as-a-judge / humanAn LLM (or a person) grades output against a rubricHandles nuance, tone, open-ended quality; scalable (LLM)Expensive, non-deterministic, biased; must be validated against humansSubjective quality, faithfulness, tone, coherence

Anthropic puts it bluntly: code-based grading is "by far the best grading method if you can design an eval that allows for it." Wherever output structure permits, use a deterministic check — they are the backbone of CI (Continuous Integration — automated checks that run on every code change) eval suites because they are cheap enough to run constantly.

A note on the middle tier: BLEU and ROUGE measure how many word sequences a candidate shares with a reference. They are cheap but surface-level — they wrongly penalize "feline" when the reference says "cat," and reward overlap even when the meaning is wrong. BERTScore embeds words with a transformer (it turns each word into a list of numbers that captures meaning in context) and compares them by similarity, correlating much better with human judgment — but it still cannot tell you whether an answer is factually true or followed instructions.

Key takeaway: Climb the cost ladder — deterministic code first, reference-based next, LLM-judge/human only for genuinely subjective qualities. Layer them: a code check for structure plus an LLM judge for tone.

LLM-as-a-judge — powerful, but validate it first

For scoring open-ended output at scale, LLM-as-a-judge — using one LLM to grade another's output against a rubric — has become the dominant technique. But it is not free truth. An unvalidated judge is just another untested model output that happens to produce a number that feels objective while being meaningless.

Analogy: An LLM judge is like a teaching assistant grading essays. You must give the TA a clear rubric and answer key (the prompt + few-shot examples), then spot-check the TA's grades against the professor's (validate against human labels). An ungraded, uncalibrated TA's scores tell you nothing.

Design principles for judges

  • Prefer binary pass/fail over 1–5 scales. As Hamel Husain puts it, "if your evaluations consist of a bunch of metrics that LLMs score on a 1–5 scale, you're doing it wrong." Nobody agrees on what makes a 3 versus a 4; binary forces you to define what actually matters and is far more consistent and actionable.
  • Reason first, then verdict. Ask the judge to think through its reasoning in <thinking> tags, then emit correct/incorrect in a <result> tag — and parse only the result. Reasoning-before-labeling measurably improves accuracy; the reasoning is discarded.
  • Pairwise beats absolute for ranking. Asking "which of A or B is better?" is easier and more reliable (often >80% human agreement) than asking the judge to assign each an absolute score.
  • Use a different model to judge than the one that generated the output, where feasible.

Validating the judge against humans

Here is the workflow practitioners follow: appoint a single principal domain expert (a "benevolent dictator" who owns the definition of "good"). They make binary pass/fail judgments with written critiques on a labeled sample. You build the judge prompt using those critiques as few-shot examples, run it on the same items, and measure agreement — but not with raw accuracy. On imbalanced data (say 95% pass), a judge that always says "pass" scores 95% while being useless. Instead measure True Positive Rate (how often it catches real failures) and True Negative Rate (how often it correctly passes good output) separately, plus Cohen's kappa for chance-corrected agreement. Iterate the prompt until TPR/TNR are high, then lock a held-out test set you touch only once.

Judge biases to neutralize

Judges have systematic biases: position bias (favoring whichever answer comes first or last), verbosity bias (preferring longer answers even when wrong), and self-preference bias (favoring text from its own model family). The standard fix for position bias in pairwise comparison is to run both orderings (A,B and B,A) and accept only a verdict that is consistent both ways.

v1 = judge(prompt, A, B)   # ask which is better
v2 = judge(prompt, B, A)   # swap the order, ask again
verdict = "A"   if v1=="A" and v2=="A"
          "B"   if v1=="B" and v2=="B"
          "tie" otherwise   # inconsistent = position bias, discard
Common mistake: Trusting an LLM judge without validating it against human labels, and using raw agreement on imbalanced data. An unvalidated, uncalibrated judge gives you a confident number that is meaningless — "who validates the validators?"
Key takeaway: An LLM judge is a model that must itself be evaluated. Use binary verdicts with reasoning-first prompts, validate against a domain expert via TPR/TNR (not raw accuracy), and neutralize position/verbosity/self-preference bias before you trust the scores.

Specialized evals: RAG and agents

Evaluating RAG (Retrieval-Augmented Generation)

RAG systems fetch relevant documents and then generate an answer from them. Failure can be in retrieval (wrong documents fetched) OR generation (good documents, bad answer), so you must evaluate them separately. The RAGAS framework defines the canonical metrics:

  • Faithfulness: the fraction of the answer's atomic claims that are supported by the retrieved context. Decompose the answer into claims, check each against the context with entailment, and score = supported ÷ total.
  • Answer relevancy: does the answer actually address the question?
  • Context precision: an order-aware measure (average precision) of whether retrieved chunks are relevant and ranked with the useful ones near the top.
  • Context recall: did retrieval bring back everything the reference answer needs? (Requires a ground-truth reference.)

The central trap: faithfulness is not truth. If the retriever returns a wrong document and the model faithfully repeats it, faithfulness is high but the answer is wrong.

Analogy: RAG faithfulness is like a court witness who faithfully repeats what they were told — they can be perfectly faithful to their source and still completely wrong, because the source was wrong. Grounding to context is not the same as being true.

Evaluating agents (multi-step systems)

An agent plans, calls tools, and acts over several steps. Evaluate at three levels: end-to-end outcome (did the task succeed?), trajectory/process (was the path of tool calls sound and efficient?), and component (which tool or sub-step broke?). Key metrics include task completion, tool correctness (right tool, right arguments), step efficiency, latency, and cost.

Anthropic's advice: grade the end-state, not the exact path. "Agents regularly find valid approaches that eval designers didn't anticipate." Reserve strict trajectory matching for steps that truly must happen in order — e.g. a banking agent must call verify_identity before transfer_funds.

Analogy: Grading an agent by its end-state instead of its exact tool sequence is like grading a maze runner by whether they reached the exit, not whether they took the one route you imagined. There are many valid paths; only a few mazes truly require a specific order.
Key takeaway: For RAG, score retrieval and generation separately and remember faithfulness ≠ correctness. For agents, grade the outcome/end-state primarily, add component checks to localize failures, and use strict trajectory matching only where order genuinely matters.

Offline and online: the full evaluation loop

Two regimes work together. Offline evaluation runs against a fixed golden set before you ship (your regression test in CI) — it catches problems you introduce. Online evaluation samples real production traffic after shipping — it catches problems that happen to you: a model provider silently updating their model, the input distribution shifting (drift), and novel user behavior. Voiceflow's offline harness once caught a ~10% accuracy drop on a routine model version upgrade that no benchmark would have surfaced.

Analogy: Offline vs online evaluation is like crash-testing a car in the lab (controlled, before release) vs telemetry from cars on the road (real conditions, after release). The lab cannot anticipate every pothole; you need both.

One important distinction: a guardrail is not an eval. A guardrail runs synchronously and inline (a millisecond budget) to block a specific bad output — a PII leak, a prompt injection, off-policy content — before it reaches the user. An eval runs asynchronously, after the fact, to measure quality. A guardrail is a smoke detector; an eval is a safety inspection.

                  THE EVALUATION FLYWHEEL
   ┌────────────────────────────────────────────────────────┐
   │                                                        │
   │   ① DEFINE success criteria (Specific, Measurable)     │
   │              │                                         │
   │              ▼                                         │
   │   ② BUILD golden dataset  ◄────────────┐               │
   │      (real + edge cases)               │ new failures  │
   │              │                         │ become tests  │
   │              ▼                         │               │
   │   ③ ERROR ANALYSIS (read traces,       │               │
   │      bucket failures by root cause)    │               │
   │              │                         │               │
   │              ▼                         │               │
   │   ④ PICK grader per criterion          │               │
   │      code > reference > LLM-judge      │               │
   │              │                         │               │
   │   ┌──────────┴──────────┐              │               │
   │   ▼                     ▼              │               │
   │  OFFLINE eval        validate judge    │               │
   │  (CI, regression)    vs human (TPR/TNR)│               │
   │   │                                    │               │
   │   ▼  ship behind canary / A-B test     │               │
   │  ⑤ ONLINE eval ───────────────────────┘               │
   │     (sample live traffic, watch guardrails:            │
   │      cost, latency, toxicity; auto-rollback)           │
   └────────────────────────────────────────────────────────┘

Statistics: an eval score is not a fact

Eval scores are estimates from a sample, so report uncertainty. Give a standard error and a 95% confidence interval (mean ± 1.96 × SEM). If Model A scores 72% and Model B 70% on 200 questions with overlapping intervals, the 2-point gap is noise. Use a paired test when comparing two models on the same questions (they tend to fail the same items, which cuts variance), and clustered standard errors when questions are grouped (many questions per document — naive errors can be 3× too small).

Analogy: Reporting an eval score without a confidence interval is like reporting an election poll without a margin of error — a "2-point lead" inside a ±3-point band is not a lead at all.

Reliability: pass@k vs pass^k

For non-deterministic agents, choose the right reliability metric. pass@k = succeeds at least once in k tries (good when any single success counts). pass^k = succeeds on all k tries (the right bar for customer-facing reliability). They diverge wildly: an agent that solves a task 6 times out of 10 has pass@10 ≈ 99.99% but pass^10 ≈ 0.6%. Reporting the flattering pass@k for a "works every time" claim is a classic error.

Best practice: Put a curated eval suite in CI that fails the build below a pass-rate threshold (e.g. <95%). Cache model responses keyed by input so unchanged tests cost nothing, run a cheap model on PRs and the full model nightly, and feed every new production failure back in as a regression test case.
Common mistake: Assuming green offline CI means production is healthy. Silent provider model updates, input drift, and novel user behavior never show up in a frozen test set — you also need online monitoring on sampled live traffic.
Key takeaway: Offline evals catch regressions you introduce; online evals catch the world changing under you. Report uncertainty (confidence intervals, paired tests), use pass^k for customer-facing reliability, and keep guardrails architecturally separate from quality evals.

Common mistakes and best practices, consolidated

Common mistake: Treating benchmark/leaderboard scores as if they predict your product's quality; skipping cheap Level-1 assertions; using 1–5 Likert scores everywhere; optimizing a goal metric while a guardrail (cost, latency, toxicity) silently regresses; and building evals too late, so you reverse-engineer criteria from messy live behavior instead of deriving clean test cases from requirements.
Best practice: Define success criteria first; build a task-specific golden dataset and grow it from real failures; prioritize volume of auto-gradable cases over a few hand-graded gems; appoint one principal expert to own "good"; invest in frictionless data-viewing tooling so error analysis actually happens; log full traces from day one (you cannot evaluate what you did not capture); and treat evals as a continuous flywheel, not a one-off.

Real-world cases. Rechat's "Lucy" real-estate assistant used tiered evals — Level-1 assertions like "never expose an internal UUID" that doubled as inference-time retry triggers — to escape whack-a-mole debugging. Honeycomb iterated a query-correctness judge to >90% agreement with its domain expert in three rounds, then used it to gate changes. Anthropic's Claude.ai web-search feature was tuned with balanced evals covering queries that should and should not trigger a search, preventing both under- and over-triggering. Medical assistants use binary LLM-judge evals to catch implicit PHI leaks ("the same medication as his father") that regex filters miss. And production platforms (Arize, Langfuse, LangSmith, Phoenix) ingest OpenTelemetry traces to sample live traffic, score it asynchronously, and auto-roll-back on guardrail breaches.

Key takeaway: The teams that win treat evals as core development work — most of their effort is reading data and doing error analysis, not building infrastructure — and they turn every production failure into a permanent regression test.

Section summary

  • Evals are systematic measurement, not vibe checks. They replace gut feeling with a repeatable score you can run on every change — the difference between teams that succeed with LLM products and teams stuck in whack-a-mole debugging.
  • Benchmark scores ≠ product quality. Model evals rank general capability; only your own task-specific evals on realistic inputs predict whether your system does your job well.
  • Looking at your data is the highest-ROI activity. Error analysis — reading traces and bucketing failures by root cause — decides what to fix and what to even measure. Generic metrics create false confidence.
  • Pick the cheapest reliable grader: deterministic code > reference-based metrics > LLM-as-a-judge/human. Layer them, and prefer binary pass/fail over fuzzy 1–5 scales.
  • An LLM judge must itself be validated against a human expert (via TPR/TNR, not raw accuracy) and de-biased for position, verbosity, and self-preference before you trust its scores.
  • RAG and agents need specialized evals: score retrieval and generation separately (faithfulness ≠ truth); grade an agent's end-state over its exact path except where order is mandatory.
  • Run both offline and online evals, report uncertainty (confidence intervals, paired tests, pass^k for reliability), keep guardrails separate, and close the flywheel by feeding every production failure back into the suite.
  • Define "good" up front (Evaluation-Driven Development). Building the eval before the feature makes quality measurable from the start and far easier than reverse-engineering criteria after launch.

Continue reading