AI Product Judgment
Most failed AI products do not fail because the model was weak. They fail because someone started with the wrong question. They asked "what cool thing can we build with AI?" instead of "what problem are we solving, and what is the simplest thing that solves it?" This chapter is about that judgment — the discipline of deciding whether to use AI at all, and if so, how much. It is the most durable skill in AI engineering because it survives every model release, framework, and pricing change. A clever prompt trick for one model version is obsolete in six months; the judgment to know that a task did not need a model at all is forever.
An LLM (large language model — a system trained to predict the next word in text, which lets it understand and generate human language) is genuinely powerful. But power tempts misuse. The famous warning applies: "when you have a hammer, everything looks like a nail." Good product judgment is the ability to look at a problem and recognize which parts are nails (fuzzy, language-shaped) and which are screws (exact, rule-shaped) that need an entirely different tool.
What It Is and Why It Matters
AI product judgment is the discipline of matching the tool to the task. The core insight is that the world is full of deterministic problems — problems with one correct, checkable answer (validation, formatting, arithmetic, lookups, fixed business rules). Plain code solves these perfectly: same input, same output, every time, for near-zero cost, with no chance of making something up. LLMs are probabilistic — they produce a likely answer that can vary between runs and can be confidently wrong. As one practitioner puts it: "Business is deterministic by design; generative AI is probabilistic by nature."
So the first job is not building — it is sorting. Use AI for judgment, not arithmetic. LLMs shine on tasks that are ambiguous, open-ended, or require interpretation: summarizing a messy report, classifying support tickets written in free text, extracting structured fields from a sloppy PDF, answering questions in natural language. They are the wrong tool for tasks a rules engine, a SQL query, or a bit of math could do instantly and reliably. Calculating a customer's debt breakdown with a generative model has been described as "a grave architectural error" — it is arithmetic with one right answer, so use code.
Why this matters so much: every LLM call carries three hidden taxes. A cost penalty (you pay per token, and that multiplies across every user), a latency penalty (a network round-trip of 300 milliseconds to several seconds, where local code finishes in microseconds), and a reliability penalty (non-determinism and hallucination, producing bugs that are hard to reproduce). When you reach for AI on a deterministic task, you pay all three taxes for zero benefit. The headline you want is not "we use GenAI" — it is "we solved the problem better."
Analogy: An LLM is a brilliant, fast intern who occasionally makes things up with total confidence. Wonderful for first drafts, summaries, and judgment calls — but you would never let them run payroll arithmetic or wire money without a deterministic check and a second set of eyes. A calculator, by contrast, is boring and never wrong. Pick the intern for interpretation; pick the calculator for math.
How It Works: The Ladder of Escalation
The practical heart of this topic is a simple rule: climb only as high as the problem forces you. Anthropic's central guidance is to "find the simplest solution possible and only add complexity when it demonstrably improves outcomes." Picture the choices as rungs on a ladder. Each rung up buys you more capability but costs more money, adds more latency, and adds more ways to fail.
THE LADDER OF ESCALATION (climb only as high as you must) more capable, more cost, more latency, more failure surface ▲ │ Rung 5 MULTI-AGENT several agents, separable ownership │ Rung 4 AGENT LLM decides its own steps & tools at runtime │ Rung 3 WORKFLOW LLM steps on predefined, hardcoded paths │ Rung 2 RAG single call + retrieved private/current facts │ Rung 1 SINGLE LLM CALL one good prompt (+ a few examples) │ Rung 0 NO AI code, regex, SQL, rules, classical ML ▼ cheaper, faster, more predictable, easier to debug
Here is the decision flow in order, from the research:
- Frame the problem, not the tech. Write down the user's job-to-be-done and how you will measure success before mentioning AI. If you cannot state the problem and the metric, you are not ready to pick a tool.
- Try the non-AI baseline. Can deterministic code, a regex, a database query, a lookup table, a rules engine, or a classical machine-learning model (like logistic regression on tabular data) solve it acceptably? If yes, stop. It is cheaper, faster, and debuggable.
- Decide if the task is genuinely AI-shaped. Use AI only if the work involves natural language, fuzzy judgment, generation, or pattern-matching that resists explicit rules. If the input-to-output mapping is exact and definable, it is not AI-shaped.
- Start at Rung 1: a single LLM call with a good prompt and a few in-context examples. Build evals immediately so you can tell whether it is good enough.
- Add RAG (Rung 2) only if the model needs current, private, or domain-specific facts it does not reliably know. RAG (Retrieval-Augmented Generation) fetches relevant documents at question time and feeds them into the prompt, so the answer is grounded in real data instead of the model's memory — and you can cite sources.
- Add a workflow (Rung 3) when the task has multiple distinct steps you can describe in advance.
- Escalate to an agent (Rung 4) only when the path cannot be hardcoded — the steps depend on intermediate results.
- Consider multi-agent (Rung 5) only when one agent's responsibilities genuinely do not fit in one loop. Most teams never need this.
Workflow versus agent — the sharpest line
Anthropic draws a clean distinction. A workflow is a system where LLM steps follow predefined code paths that your code controls — predictable, testable, cheaper, lower-latency. An agent is a system where the LLM dynamically decides how to accomplish the task, choosing its own sequence of tool calls over many turns with feedback from the environment. The test is simple: if you can describe the steps in advance, build a workflow; use an agent only when you genuinely cannot hardcode the path.
The Model Is Not the Product
A common trap is confusing the model for the product. An LLM is a component, not a finished thing.
Analogy: An LLM is an engine, not a car. By itself it is powerful but goes nowhere useful until you bolt on a chassis (workflow integration), steering (guardrails and policies), and a dashboard (evals and metrics). Shipping the raw model and calling it a product is shipping an engine on a stand.
A real productized AI system solves a defined business problem, integrates into existing workflows, respects policy and compliance, and delivers measurable outcomes. The hard 80% of the work is the surrounding evals, guardrails, data plumbing, and UX — not the model call. This is why getting a demo to roughly 60% quality is easy, but the journey from 60% to 95%+ is where almost all the effort lives. LinkedIn reportedly hit 80% quality on generative features in about one month, then needed four more months to pass 95%. Plan your roadmap around that last-mile curve, not the demo.
Analogy: Going from 60% to 95% is the last mile of delivery. The package crosses 99% of the distance easily on the highway; the final stretch to the doorstep — hallucinations, edge cases, latency, infinite input combinations — is where most of the cost and time hide.
Designing for Wrong Answers: Autonomy, Hallucination, and Trust
A durable engineering fact: hallucination is permanent. A model is a statistical "fill-in-the-blank" machine — it stores patterns about which words tend to follow which words, not a verified store of facts. RAG and lower randomness settings reduce the rate of confident wrongness but never reach zero. So the design conclusion is not "wait for a better model" — it is assume the model will be confidently wrong sometimes, and design the product so a wrong answer is cheap to catch and cheap to recover from.
Augmentation vs automation — decide per action
The core product decision is augmentation vs automation. Augmentation keeps the human central: the AI proposes, drafts, or surfaces, and the human decides, edits, or approves. Automation lets the AI act without per-action review. Crucially, this is not one setting for the whole product — you choose per action, scoring each on two axes: reversibility (can a wrong action be undone cheaply?) and blast radius (how much damage if wrong?).
HIGH blast radius
│
draft + wait for │ draft + wait for
approval (HITL) │ approval (HITL)
e.g. send a refund │ e.g. delete records,
│ charge a card
──────────────────────┼──────────────────────
auto-run, human │ auto-run silently
monitors patterns │ e.g. read a file,
(on-the-loop) │ classify a ticket
│
LOW blast radius
irreversible ◄───────┴───────► reversible
A single agent mixes rungs: read a file = auto; write / delete / charge = require approval. The autonomy spectrum, from least to most autonomous, runs: (1) AI suggests, human does it; (2) AI drafts, human edits before sending; (3) AI acts but pauses for approval on risky actions (human-in-the-loop); (4) AI acts autonomously while a human watches aggregate behavior and can intervene (human-on-the-loop); (5) full automation.
A mature human-in-the-loop gate offers four decisions, not a yes/no: approve (run as-is), edit (change the arguments then run), reject (refuse with feedback so the agent can retry), and respond (the human's answer becomes the tool result). And you gate conditionally — pause only when the action's arguments cross a real risk threshold — so humans are not drowned in approvals they will rubber-stamp.
interrupt_on = {
"read_record": {"auto": True},
"send_refund": {"allowed": ["approve","edit","reject"],
"when": lambda req: req.args["amount"] > 50},
"delete_records":{"allowed": ["approve","reject"]},
}
# a $20 refund auto-runs; a $2,000 refund pauses for a human
Pausing for a human requires checkpointing — durably saving the agent's full state, keyed to a conversation id, so when the human responds minutes or days later you resume mid-task instead of losing all in-progress work.
Trust calibration and the explanation trap
The goal is not to maximize trust but to calibrate it — get the user to rely on the AI exactly where it is reliable and use their own judgment where it is not. Two failure modes bracket the target: overtrust / automation bias (accepting wrong output uncritically) and undertrust (ignoring a correct answer and losing the AI's value). Automation bias is real and measured: even 80–90% accurate systems introduce new errors because users stop checking, and it worsens under time pressure and when the AI sounds authoritative.
Analogy: Trust calibration is like trusting a weather app — you have learned it is great about tomorrow and unreliable at ten days out, so you pack accordingly. Miscalibration is either carrying an umbrella every day because it once rained (undertrust) or planning an outdoor wedding on a ten-day forecast (overtrust).
UX for uncertainty and graceful failure
Make uncertainty legible at the point of decision: first-person hedging ("I'm not sure, but…"), categorical confidence (High / Medium / Low rather than fake-precise "87.3%"), clickable inline citations that invite verification, and surfacing disagreement when you run the model several times. Every AI feature also needs an explicit failure design: graceful degradation (fall back to a simpler tool, a cached result, or a read-only mode instead of crashing), a clean abstention path (let the model say "I don't know" and route onward), and human escalation ("Talk to a person") as an always-visible, first-class action. A blank or broken output reads as "the product is broken," and a fabricated answer is worse than an honest error.
Evaluation, Security, and Building for Change
If there is one root cause behind failed AI products, it is the absence of robust evals — systematic tests of whether the system meets its quality bar. And evals must be domain-specific: a generic "helpfulness" score tells you almost nothing about your app. The durable practice is error analysis: gather around 100 real traces, have a domain expert write free-text notes on what went wrong (open coding), cluster those notes into a failure taxonomy like "wrong tone" or "hallucinated field" (axial coding), and only then write narrow, mostly binary (pass/fail) evals for the failures you actually observed. Binary beats a 1–5 scale: it forces clearer judgments and needs smaller samples.
For the failures only a human could judge, you can use an LLM-as-a-judge (one LLM grading another's output) — but you must validate it against a human expert via critique shadowing (the expert's written critiques become few-shot examples in the judge prompt), and measure its true-positive and true-negative rates separately, not just raw agreement. One team reached over 90% agreement with their expert in three iterations this way.
Analogy: Evals are unit tests for a non-deterministic function. You cannot assert exact equality, so you assert properties — "contains the legal disclaimer," "never names a competitor," "the judge says pass" — and you write a test for each bug you actually saw, just as you add a regression test after a real incident.
Security: prompt injection and data leakage
Prompt injection is a structural limitation, not a patchable bug. An LLM reads trusted instructions and untrusted data in the same stream of tokens, so any text it ingests — a user message, a web page, a retrieved document, an email — can hijack it. It is the natural-language cousin of SQL injection, and OWASP ranks it the #1 LLM risk. There is no complete fix; you use defense-in-depth and least privilege: separate instructions from data, give each tool the minimum permissions (a summarizer agent should not have a "send email" or "delete" tool), filter inputs and outputs, and require human approval for high-risk actions. Privacy is a first-class risk too — models can memorize and regurgitate secrets, so mask PII on the way in, guardrail outputs on the way out, send the minimum data needed, and secure zero-retention / no-training terms for sensitive workloads.
Analogy: Prompt injection is SQL injection for natural language. Because instructions and data share one channel, untrusted text can take over — and the mitigations rhyme: isolate, least-privilege, validate, and never fully trust the input.
Build for model churn
Models get deprecated, repriced, and behave differently across versions. Treat model choice as a swappable dependency behind a model abstraction layer (an AI gateway) — a single internal interface so swapping providers is a config change, not a rewrite. Treat prompts as versioned production assets (in git or a registry, never inline strings) with a frozen regression testbed so you can diff old-versus-new outputs and catch regressions before they reach users.
class LLMGateway:
def complete(self, messages, *, task):
model = ROUTING[task] # easy -> small, hard -> large
return PROVIDERS[model.provider].call(model, messages)
# App code calls gateway.complete(...), never a vendor SDK directly,
# so changing models is config, not a rewrite.
Real-World Use Cases
| Scenario | Right rung / pattern | Why |
|---|---|---|
| Email validation, snake_case → camelCase | Rung 0 — plain code / regex | Exact, one right answer; an LLM adds cost, latency, and a chance of error for pure downside. |
| Home-energy appliance scheduling | Rung 0 — greedy rule or linear programming | Chip Huyen's example: a simple rule ("run laundry after 10pm off-peak") likely beats an LLM and is reliable; the team never tested AI against the baseline. |
| Routing support tickets to billing/technical/sales | Rung 1 — single LLM call | Messy free-text classification; one prompt plus a few examples is plenty. |
| Support bot answering from your help docs | Rung 2 — RAG | Needs current, private facts; retrieval grounds answers and cites sources, updating when docs change with no retraining. |
| "Translate, then summarize, then format as JSON" | Rung 3 — workflow (prompt chain) | The three steps are known in advance, so hardcode the path for predictability. |
| "Debug why this server is failing using whatever logs help" | Rung 4 — agent | The steps depend on what it finds; the path cannot be hardcoded. Needs sandboxing and guardrails. |
| Invoice-total extraction from PDFs | AI + deterministic guardrail | Probabilistic extraction wrapped in a deterministic check (line-items must sum to the total) with human review on mismatch. |
| Coding agents (Claude Code, Cursor, Copilot) | Per-action HITL | Read files freely (auto); propose-and-confirm before writing, running shell commands, or pushing. |
| Legal / financial research | Specialized domain tool | High precision, low error tolerance, citations required; a general LLM may fabricate cases — real lawyers have been sanctioned for citing AI-invented cases. |
Section summary
- Start from the problem and a success metric, never from the technology; the worst failure mode is a solution chasing a problem.
- Always try the non-AI baseline first — deterministic code, regex, SQL, rules, or classical ML — and reserve probabilistic LLMs for genuinely fuzzy, language-shaped, generative tasks.
- Climb the ladder (no AI → single call → RAG → workflow → agent → multi-agent) only as high as the problem forces; every rung adds cost, latency, and compounding-error risk you must justify.
- Choose a workflow whenever you can describe the steps; use an agent only when the path must be decided at runtime.
- The model is a component, not a product — most of the work and the hard 60%-to-95% last mile live in evals, guardrails, integration, and UX.
- Hallucination is permanent: design per-action autonomy from reversibility × blast radius, use cognitive forcing functions (not explanations) for high-stakes decisions, and always provide graceful degradation, abstention, and a human escape hatch.
- Build domain-specific evals from real error analysis, validate any LLM judge against a human, and treat a 100% pass rate as a warning that the tests are too easy.
- Defend against prompt injection and data leakage with least privilege and defense-in-depth, and build for model churn with an abstraction layer and versioned, regression-tested prompts.