AI Product Judgment

By Pritesh Yadav 18 min read

Most failed AI products do not fail because the model was weak. They fail because someone started with the wrong question. They asked "what cool thing can we build with AI?" instead of "what problem are we solving, and what is the simplest thing that solves it?" This chapter is about that judgment — the discipline of deciding whether to use AI at all, and if so, how much. It is the most durable skill in AI engineering because it survives every model release, framework, and pricing change. A clever prompt trick for one model version is obsolete in six months; the judgment to know that a task did not need a model at all is forever.

An LLM (large language model — a system trained to predict the next word in text, which lets it understand and generate human language) is genuinely powerful. But power tempts misuse. The famous warning applies: "when you have a hammer, everything looks like a nail." Good product judgment is the ability to look at a problem and recognize which parts are nails (fuzzy, language-shaped) and which are screws (exact, rule-shaped) that need an entirely different tool.

What It Is and Why It Matters

AI product judgment is the discipline of matching the tool to the task. The core insight is that the world is full of deterministic problems — problems with one correct, checkable answer (validation, formatting, arithmetic, lookups, fixed business rules). Plain code solves these perfectly: same input, same output, every time, for near-zero cost, with no chance of making something up. LLMs are probabilistic — they produce a likely answer that can vary between runs and can be confidently wrong. As one practitioner puts it: "Business is deterministic by design; generative AI is probabilistic by nature."

So the first job is not building — it is sorting. Use AI for judgment, not arithmetic. LLMs shine on tasks that are ambiguous, open-ended, or require interpretation: summarizing a messy report, classifying support tickets written in free text, extracting structured fields from a sloppy PDF, answering questions in natural language. They are the wrong tool for tasks a rules engine, a SQL query, or a bit of math could do instantly and reliably. Calculating a customer's debt breakdown with a generative model has been described as "a grave architectural error" — it is arithmetic with one right answer, so use code.

Why this matters so much: every LLM call carries three hidden taxes. A cost penalty (you pay per token, and that multiplies across every user), a latency penalty (a network round-trip of 300 milliseconds to several seconds, where local code finishes in microseconds), and a reliability penalty (non-determinism and hallucination, producing bugs that are hard to reproduce). When you reach for AI on a deterministic task, you pay all three taxes for zero benefit. The headline you want is not "we use GenAI" — it is "we solved the problem better."

Analogy: An LLM is a brilliant, fast intern who occasionally makes things up with total confidence. Wonderful for first drafts, summaries, and judgment calls — but you would never let them run payroll arithmetic or wire money without a deterministic check and a second set of eyes. A calculator, by contrast, is boring and never wrong. Pick the intern for interpretation; pick the calculator for math.
Key takeaway: Start from the problem, not the technology. Reserve probabilistic AI for genuinely fuzzy, language-shaped tasks, and let deterministic code own anything with one correct, checkable answer.

How It Works: The Ladder of Escalation

The practical heart of this topic is a simple rule: climb only as high as the problem forces you. Anthropic's central guidance is to "find the simplest solution possible and only add complexity when it demonstrably improves outcomes." Picture the choices as rungs on a ladder. Each rung up buys you more capability but costs more money, adds more latency, and adds more ways to fail.

  THE LADDER OF ESCALATION  (climb only as high as you must)
  more capable, more cost, more latency, more failure surface
  ▲
  │  Rung 5  MULTI-AGENT     several agents, separable ownership
  │  Rung 4  AGENT           LLM decides its own steps & tools at runtime
  │  Rung 3  WORKFLOW        LLM steps on predefined, hardcoded paths
  │  Rung 2  RAG             single call + retrieved private/current facts
  │  Rung 1  SINGLE LLM CALL one good prompt (+ a few examples)
  │  Rung 0  NO AI           code, regex, SQL, rules, classical ML
  ▼
  cheaper, faster, more predictable, easier to debug

Here is the decision flow in order, from the research:

  1. Frame the problem, not the tech. Write down the user's job-to-be-done and how you will measure success before mentioning AI. If you cannot state the problem and the metric, you are not ready to pick a tool.
  2. Try the non-AI baseline. Can deterministic code, a regex, a database query, a lookup table, a rules engine, or a classical machine-learning model (like logistic regression on tabular data) solve it acceptably? If yes, stop. It is cheaper, faster, and debuggable.
  3. Decide if the task is genuinely AI-shaped. Use AI only if the work involves natural language, fuzzy judgment, generation, or pattern-matching that resists explicit rules. If the input-to-output mapping is exact and definable, it is not AI-shaped.
  4. Start at Rung 1: a single LLM call with a good prompt and a few in-context examples. Build evals immediately so you can tell whether it is good enough.
  5. Add RAG (Rung 2) only if the model needs current, private, or domain-specific facts it does not reliably know. RAG (Retrieval-Augmented Generation) fetches relevant documents at question time and feeds them into the prompt, so the answer is grounded in real data instead of the model's memory — and you can cite sources.
  6. Add a workflow (Rung 3) when the task has multiple distinct steps you can describe in advance.
  7. Escalate to an agent (Rung 4) only when the path cannot be hardcoded — the steps depend on intermediate results.
  8. Consider multi-agent (Rung 5) only when one agent's responsibilities genuinely do not fit in one loop. Most teams never need this.

Workflow versus agent — the sharpest line

Anthropic draws a clean distinction. A workflow is a system where LLM steps follow predefined code paths that your code controls — predictable, testable, cheaper, lower-latency. An agent is a system where the LLM dynamically decides how to accomplish the task, choosing its own sequence of tool calls over many turns with feedback from the environment. The test is simple: if you can describe the steps in advance, build a workflow; use an agent only when you genuinely cannot hardcode the path.

Common mistake: Jumping straight to an agent (or multi-agent) because the demo looked impressive, when a single call or a simple workflow would do. Agents add cost, latency, and compounding errors — small mistakes at each step accumulate across the chain. Three steps that are each 90% reliable yield only about 73% end-to-end accuracy (0.9 × 0.9 × 0.9). Every extra hop multiplies failure.
Key takeaway: Default to the lowest rung that could work and treat every step up — code → single call → RAG → workflow → agent — as something you must justify with a measured improvement in outcomes.

The Model Is Not the Product

A common trap is confusing the model for the product. An LLM is a component, not a finished thing.

Analogy: An LLM is an engine, not a car. By itself it is powerful but goes nowhere useful until you bolt on a chassis (workflow integration), steering (guardrails and policies), and a dashboard (evals and metrics). Shipping the raw model and calling it a product is shipping an engine on a stand.

A real productized AI system solves a defined business problem, integrates into existing workflows, respects policy and compliance, and delivers measurable outcomes. The hard 80% of the work is the surrounding evals, guardrails, data plumbing, and UX — not the model call. This is why getting a demo to roughly 60% quality is easy, but the journey from 60% to 95%+ is where almost all the effort lives. LinkedIn reportedly hit 80% quality on generative features in about one month, then needed four more months to pass 95%. Plan your roadmap around that last-mile curve, not the demo.

Analogy: Going from 60% to 95% is the last mile of delivery. The package crosses 99% of the distance easily on the highway; the final stretch to the doorstep — hallucinations, edge cases, latency, infinite input combinations — is where most of the cost and time hide.
Key takeaway: The model is one part inside a product. Budget the majority of your effort for the wrapper — evals, guardrails, integration, and UX — because that is where the last mile to reliable quality actually gets crossed.

Designing for Wrong Answers: Autonomy, Hallucination, and Trust

A durable engineering fact: hallucination is permanent. A model is a statistical "fill-in-the-blank" machine — it stores patterns about which words tend to follow which words, not a verified store of facts. RAG and lower randomness settings reduce the rate of confident wrongness but never reach zero. So the design conclusion is not "wait for a better model" — it is assume the model will be confidently wrong sometimes, and design the product so a wrong answer is cheap to catch and cheap to recover from.

Augmentation vs automation — decide per action

The core product decision is augmentation vs automation. Augmentation keeps the human central: the AI proposes, drafts, or surfaces, and the human decides, edits, or approves. Automation lets the AI act without per-action review. Crucially, this is not one setting for the whole product — you choose per action, scoring each on two axes: reversibility (can a wrong action be undone cheaply?) and blast radius (how much damage if wrong?).

                 HIGH blast radius
                        │
   draft + wait for     │   draft + wait for
   approval (HITL)      │   approval (HITL)
   e.g. send a refund   │   e.g. delete records,
                        │   charge a card
  ──────────────────────┼──────────────────────
   auto-run, human      │   auto-run silently
   monitors patterns    │   e.g. read a file,
   (on-the-loop)        │   classify a ticket
                        │
                 LOW blast radius
   irreversible ◄───────┴───────► reversible

A single agent mixes rungs: read a file = auto; write / delete / charge = require approval. The autonomy spectrum, from least to most autonomous, runs: (1) AI suggests, human does it; (2) AI drafts, human edits before sending; (3) AI acts but pauses for approval on risky actions (human-in-the-loop); (4) AI acts autonomously while a human watches aggregate behavior and can intervene (human-on-the-loop); (5) full automation.

A mature human-in-the-loop gate offers four decisions, not a yes/no: approve (run as-is), edit (change the arguments then run), reject (refuse with feedback so the agent can retry), and respond (the human's answer becomes the tool result). And you gate conditionally — pause only when the action's arguments cross a real risk threshold — so humans are not drowned in approvals they will rubber-stamp.

interrupt_on = {
  "read_record":   {"auto": True},
  "send_refund":   {"allowed": ["approve","edit","reject"],
                    "when": lambda req: req.args["amount"] > 50},
  "delete_records":{"allowed": ["approve","reject"]},
}
# a $20 refund auto-runs; a $2,000 refund pauses for a human

Pausing for a human requires checkpointing — durably saving the agent's full state, keyed to a conversation id, so when the human responds minutes or days later you resume mid-task instead of losing all in-progress work.

Trust calibration and the explanation trap

The goal is not to maximize trust but to calibrate it — get the user to rely on the AI exactly where it is reliable and use their own judgment where it is not. Two failure modes bracket the target: overtrust / automation bias (accepting wrong output uncritically) and undertrust (ignoring a correct answer and losing the AI's value). Automation bias is real and measured: even 80–90% accurate systems introduce new errors because users stop checking, and it worsens under time pressure and when the AI sounds authoritative.

Common mistake: Believing that adding an explanation cures overreliance. Replicated research shows that simply adding a plausible-looking explanation does not reliably reduce overreliance and can increase it — a confident rationale makes a wrong answer more persuasive. What actually works is a cognitive forcing function: a design that compels the user to think before accepting — make them commit to their own decision first and reveal the AI second, hide the suggestion behind a click, or insert a deliberate pause. The catch: these are the very designs users rate as harder and like least, so reserve them for high-stakes decisions.
Analogy: Trust calibration is like trusting a weather app — you have learned it is great about tomorrow and unreliable at ten days out, so you pack accordingly. Miscalibration is either carrying an umbrella every day because it once rained (undertrust) or planning an outdoor wedding on a ten-day forecast (overtrust).

UX for uncertainty and graceful failure

Make uncertainty legible at the point of decision: first-person hedging ("I'm not sure, but…"), categorical confidence (High / Medium / Low rather than fake-precise "87.3%"), clickable inline citations that invite verification, and surfacing disagreement when you run the model several times. Every AI feature also needs an explicit failure design: graceful degradation (fall back to a simpler tool, a cached result, or a read-only mode instead of crashing), a clean abstention path (let the model say "I don't know" and route onward), and human escalation ("Talk to a person") as an always-visible, first-class action. A blank or broken output reads as "the product is broken," and a fabricated answer is worse than an honest error.

Key takeaway: Treat wrong answers as a certainty, not an accident. Pick the autonomy rung per action from reversibility × blast radius, use forcing functions (not explanations) where stakes are high, and always provide degradation, abstention, and a human escape hatch.

Evaluation, Security, and Building for Change

If there is one root cause behind failed AI products, it is the absence of robust evals — systematic tests of whether the system meets its quality bar. And evals must be domain-specific: a generic "helpfulness" score tells you almost nothing about your app. The durable practice is error analysis: gather around 100 real traces, have a domain expert write free-text notes on what went wrong (open coding), cluster those notes into a failure taxonomy like "wrong tone" or "hallucinated field" (axial coding), and only then write narrow, mostly binary (pass/fail) evals for the failures you actually observed. Binary beats a 1–5 scale: it forces clearer judgments and needs smaller samples.

For the failures only a human could judge, you can use an LLM-as-a-judge (one LLM grading another's output) — but you must validate it against a human expert via critique shadowing (the expert's written critiques become few-shot examples in the judge prompt), and measure its true-positive and true-negative rates separately, not just raw agreement. One team reached over 90% agreement with their expert in three iterations this way.

Analogy: Evals are unit tests for a non-deterministic function. You cannot assert exact equality, so you assert properties — "contains the legal disclaimer," "never names a competitor," "the judge says pass" — and you write a test for each bug you actually saw, just as you add a regression test after a real incident.

Security: prompt injection and data leakage

Prompt injection is a structural limitation, not a patchable bug. An LLM reads trusted instructions and untrusted data in the same stream of tokens, so any text it ingests — a user message, a web page, a retrieved document, an email — can hijack it. It is the natural-language cousin of SQL injection, and OWASP ranks it the #1 LLM risk. There is no complete fix; you use defense-in-depth and least privilege: separate instructions from data, give each tool the minimum permissions (a summarizer agent should not have a "send email" or "delete" tool), filter inputs and outputs, and require human approval for high-risk actions. Privacy is a first-class risk too — models can memorize and regurgitate secrets, so mask PII on the way in, guardrail outputs on the way out, send the minimum data needed, and secure zero-retention / no-training terms for sensitive workloads.

Analogy: Prompt injection is SQL injection for natural language. Because instructions and data share one channel, untrusted text can take over — and the mitigations rhyme: isolate, least-privilege, validate, and never fully trust the input.

Build for model churn

Models get deprecated, repriced, and behave differently across versions. Treat model choice as a swappable dependency behind a model abstraction layer (an AI gateway) — a single internal interface so swapping providers is a config change, not a rewrite. Treat prompts as versioned production assets (in git or a registry, never inline strings) with a frozen regression testbed so you can diff old-versus-new outputs and catch regressions before they reach users.

class LLMGateway:
    def complete(self, messages, *, task):
        model = ROUTING[task]          # easy -> small, hard -> large
        return PROVIDERS[model.provider].call(model, messages)
# App code calls gateway.complete(...), never a vendor SDK directly,
# so changing models is config, not a rewrite.
Best practice: Make error analysis a recurring ritual — a full pass on ~100 fresh traces every few weeks during active development — and route the worst traces back into both your eval set and your prompt iteration. This feedback loop is the engine that moves you from 60% to 95%. Combine the rare explicit signals (thumbs up/down, under 1% of interactions) with abundant implicit ones (copy/paste, re-asks, abandonment).
Common mistake: Treating a 100% eval pass rate as success. It almost always means your evals are too easy, not that you are done. As Hamel Husain notes, "your pass rate is a product decision, depending on the failures you are willing to tolerate" — a meaningful suite around 70% is healthier than a trivial one at 100%.
Key takeaway: Evals built from real error analysis are the product moat; prompt injection and data leakage are structural risks you mitigate with least privilege and defense-in-depth; and an abstraction layer plus versioned prompts let you survive constant model churn.

Real-World Use Cases

ScenarioRight rung / patternWhy
Email validation, snake_case → camelCaseRung 0 — plain code / regexExact, one right answer; an LLM adds cost, latency, and a chance of error for pure downside.
Home-energy appliance schedulingRung 0 — greedy rule or linear programmingChip Huyen's example: a simple rule ("run laundry after 10pm off-peak") likely beats an LLM and is reliable; the team never tested AI against the baseline.
Routing support tickets to billing/technical/salesRung 1 — single LLM callMessy free-text classification; one prompt plus a few examples is plenty.
Support bot answering from your help docsRung 2 — RAGNeeds current, private facts; retrieval grounds answers and cites sources, updating when docs change with no retraining.
"Translate, then summarize, then format as JSON"Rung 3 — workflow (prompt chain)The three steps are known in advance, so hardcode the path for predictability.
"Debug why this server is failing using whatever logs help"Rung 4 — agentThe steps depend on what it finds; the path cannot be hardcoded. Needs sandboxing and guardrails.
Invoice-total extraction from PDFsAI + deterministic guardrailProbabilistic extraction wrapped in a deterministic check (line-items must sum to the total) with human review on mismatch.
Coding agents (Claude Code, Cursor, Copilot)Per-action HITLRead files freely (auto); propose-and-confirm before writing, running shell commands, or pushing.
Legal / financial researchSpecialized domain toolHigh precision, low error tolerance, citations required; a general LLM may fabricate cases — real lawyers have been sanctioned for citing AI-invented cases.
Key takeaway: The same product spans many rungs — easy parts get automated cheaply, risky parts get gated, and exact parts never touch the model at all. Match each task to its rung individually.

Section summary

  • Start from the problem and a success metric, never from the technology; the worst failure mode is a solution chasing a problem.
  • Always try the non-AI baseline first — deterministic code, regex, SQL, rules, or classical ML — and reserve probabilistic LLMs for genuinely fuzzy, language-shaped, generative tasks.
  • Climb the ladder (no AI → single call → RAG → workflow → agent → multi-agent) only as high as the problem forces; every rung adds cost, latency, and compounding-error risk you must justify.
  • Choose a workflow whenever you can describe the steps; use an agent only when the path must be decided at runtime.
  • The model is a component, not a product — most of the work and the hard 60%-to-95% last mile live in evals, guardrails, integration, and UX.
  • Hallucination is permanent: design per-action autonomy from reversibility × blast radius, use cognitive forcing functions (not explanations) for high-stakes decisions, and always provide graceful degradation, abstention, and a human escape hatch.
  • Build domain-specific evals from real error analysis, validate any LLM judge against a human, and treat a 100% pass rate as a warning that the tests are too easy.
  • Defend against prompt injection and data leakage with least privilege and defense-in-depth, and build for model churn with an abstraction layer and versioned, regression-tested prompts.

Continue reading