Reliability, Evaluation & Observability

By Pritesh Yadav 19 min read

How to keep long-running, multi-agent LLM systems from quietly going off the rails — and how to know when they do.

Research/reference doc · 2026-06-16 · part of the Agent Orchestration suite


1. Why agents break the old reliability model

A single prompt→response call is easy to reason about: one input, one output, fail or succeed. The moment an agent starts to plan, call tools, hold state, and talk to other agents, you inherit every classic distributed-systems failure class — race conditions, partial failures, inconsistent state, cascading errors, deadlock — and then add a probabilistic engine on top of it.

The long-horizon cliff. Frontier models hit near-100% success on tasks a human could do in under ~4 minutes, but success falls below 10% on tasks needing more than ~4 hours of human-equivalent work. As you push agents into longer-horizon work, time-compounding failure modes become the dominant source of cost and unreliability — not raw model IQ.

The compounding-error math. If each step is independently 95% reliable, a 20-step trajectory is only 0.95²⁰ ≈ 36% reliable. Reliability decays geometrically with horizon length:

steps:        1     5     10     20     40
@95%/step:   95%   77%   60%    36%    13%
@99%/step:   99%   95%   90%    82%    67%

Early mistakes rarely stay local — they “cascade into subsequent steps, distorting reasoning, compounding misjudgments, and ultimately derailing the entire trajectory.” Once a cascade starts it is hard to reverse, so early detection and correction matter far more than post-hoc repair.

Anthropic’s framing: “Minor system failures can be catastrophic for agents.” Because agents are long-running and stateful, a small fault deep into a run wastes everything done before it.

Two design consequences flow from this whole section, and the rest of the doc elaborates them:

  1. Cut horizon length and step count wherever you can (each step is a dice roll).
  2. Detect-and-correct mid-run beats restart-from-scratch (see §7 resumability).

2. Failure modes

2a. MAST — the empirical taxonomy

The anchor reference is “Why Do Multi-Agent LLM Systems Fail?” (arXiv 2503.13657), the first empirically grounded multi-agent failure taxonomy. It was built from 1,600+ annotated execution traces across 7 popular MAS frameworks, covering coding/math/general-agent tasks on GPT-4, Claude 3, Qwen 2.5, and CodeLlama, validated by 6 expert annotators with inter-annotator agreement Cohen’s κ = 0.88. It defines 14 failure modes in 3 categories.

The central, uncomfortable finding: many failures stem from organizational / system design, not raw model capability. You can’t model-upgrade your way out of a bad orchestration design.

Category-level share of failures:

CategoryShare
Specification & system design≈ 41.8%
Inter-agent misalignment≈ 36.9%
Task verification & termination≈ 21% (remainder)
CatFailure modeDefinition
FC1 Specification / System DesignFM-1.1 Disobey task specificationIgnores stated constraints / requirements
FM-1.2 Disobey role specificationActs outside its assigned role boundaries
FM-1.3 Step repetitionNeedlessly redoes completed steps → delay/error
FM-1.4 Loss of conversation historyContext truncated; reverts to an earlier state
FM-1.5 Unaware of termination conditionsDoesn’t know when to stop (→ loops)
FC2 Inter-Agent MisalignmentFM-2.1 Conversation resetUnwarranted restart losing context/progress
FM-2.2 Fail to ask for clarificationProceeds on ambiguous/unclear input
FM-2.3 Task derailmentDrifts toward irrelevant/unproductive actions
FM-2.4 Information withholdingFails to share insight peers needed
FM-2.5 Ignored other agent’s inputDisregards a collaborator’s recommendation
FM-2.6 Reasoning-action mismatchReasoning diverges from the executed action
FC3 Task Verification / TerminationFM-3.1 Premature terminationEnds before objectives are met
FM-3.2 No/incomplete verificationSkips checking outcomes → errors propagate
FM-3.3 Incorrect verificationValidation done, but inadequately/wrongly

2b. Cascading subagent failures

“A misaligned agent produces an output that the next agent accepts as authoritative, and a verification failure that should have caught the problem is never triggered.” Multi-agent systems amplify error across independent agents rather than averaging it out. Anthropic notes that asynchronous orchestration would add parallelism but introduces fresh hazards: “challenges in result coordination, state consistency, and error propagation across the subagents.”

2c. Infinite loops / non-termination

Maps to MAST FM-1.5 (no termination awareness) and FM-1.3 (step repetition). Agents reinterpret goals differently and ping-pong; missing termination conditions are a top specification-class defect. Mitigation: explicit max-steps / max-cost budgets, loop detection (repeated identical tool calls / states), and forced termination.

2d. Tool misuse (“functional hallucination”)

Misusing tools/APIs: wrong tool, invalid arguments, or an unsafe plan. The function-calling / ToolScan literature breaks it into three categories:

CodeErrorExample
IFNIncorrect Function NameCalls a function not in the available set (a hallucinated tool)
IANIncorrect Argument NameHallucinated parameter names
IAVIncorrect Argument ValueWrong values, omitted required args, malformed/invalid JSON

Security angle: an agent told never to read secrets can hallucinate a read_secret capability — and if such a function happens to exist in the environment, the call executes. Hallucinated tool names can become real privilege escalation. Mitigations: strict JSON-Schema / Pydantic tool schemas, validate args before execution, fail closed on invalid args, and write precise tool descriptions with examples and constraints.

2e. Hallucinated coordination

Agents fabricate that a handoff or communication occurred, or act on outputs another agent never produced. Overlaps FM-2.6 (reasoning-action mismatch) and FM-2.4/2.5 (communication breakdowns), plus “format mismatches between agents” and “context loss during handoffs.” Treat every inter-agent handoff as an untrusted boundary: validate the payload shape and provenance, don’t assume the upstream agent did what it claimed.

2f. Deadlock / synchronous bottlenecks

Anthropic’s lead agent runs subagents synchronously, blocking until each finishes. This “simplifies coordination, but creates bottlenecks”: the lead can’t steer subagents mid-flight, subagents can’t coordinate with each other, and one slow subagent stalls the whole system. This is the practical head-of-line-blocking shape of “deadlock” in current production multi-agent systems — usually not a true lock cycle, but a slow dependency freezing the pipeline.


3. Guardrails

Guardrails = technical controls, validation layers, and policy enforcement that constrain agent behavior within safety/compliance/operational boundaries — block harmful outputs, block adversarial inputs, and keep agents within authorized scope.

Layered model

┌──────────────────────────────────────────────────────────┐
│ INPUT VALIDATION    context filtering, access control,     │
│                     prompt-injection / jailbreak detection,│
│                     relevance + blocklist + safety classify│
├──────────────────────────────────────────────────────────┤
│ ACTION GUARDRAILS   per tool call:                         │
│  (every tool call)   • authorization check                 │
│                      • schema validation on arguments      │
│                      • rate limit                          │
│                      • cost meter / budget ceiling         │
│                      • kill switch                         │
├──────────────────────────────────────────────────────────┤
│ OUTPUT VALIDATION   JSON-schema compliance BEFORE anything │
│                     downstream consumes the output         │
├──────────────────────────────────────────────────────────┤
│ PERMISSION GATING   least privilege per tool;              │
│                     act only within authorized scope       │
├──────────────────────────────────────────────────────────┤
│ HUMAN-IN-THE-LOOP   pause the run for approval when risk   │
│                     exceeds an automation threshold        │
└──────────────────────────────────────────────────────────┘
  • Input validation — block requests carrying sensitive data, detect prompt injection/jailbreaks, prevent connecting to unauthorized APIs; relevance/keyword/blocklist filters and safety classifiers.
  • Action-level guardrails — wrap every tool call with authorization, argument schema validation, rate limit, cost meter, and a kill switch. Whitelist API calls; put budget ceilings on LLM spend so a misconfigured agent can’t spiral.
  • Output validation — for structured outputs feeding downstream APIs, enforce JSON-schema compliance and validate before anything consumes it.
  • Human-in-the-loop checkpoints — inject human judgment when risk exceeds automation thresholds; require human approval for production changes and high-stakes actions.

Vendor guidance

  • OpenAI (A Practical Guide to Building Agents + Agents SDK): guardrails are functions or agents enforcing jailbreak prevention, relevance, keyword filtering, blocklists, and safety classification — “critical at every stage, from input filtering and tool use to human-in-the-loop intervention.” Human review pauses the run so a person/policy approves or rejects a sensitive action; this is especially important early in deployment to surface failures and edge cases and bootstrap the eval cycle. OpenAI’s Guardrails library ships an eval framework reporting precision, recall, F1 against labeled test data.
  • LangGraph has native tool scoping and human-in-the-loop primitives (interrupt / approve). Proxy layers (Invariant Gateway, agentgateway) centralize cross-agent policy enforcement.

PF360 house-rule tie-in (CLAUDE.md §0). If a guardrail or integration isn’t configured, never show a broken/empty/silent-failure state. Hide or disable the control, explain in plain language what’s missing, and link to where to set it up — e.g. “No safety classifier configured. Set one up → Settings › Guardrails.”


4. Evaluation

4a. Why exact-match fails for agents

Agents don’t follow predetermined paths. Anthropic: “Even with identical starting points, agents might take completely different valid paths to reach their goal.” Step-by-step exact validation is therefore impossible. Concretely: if the gold trajectory calls search_web but the agent calls search_documentation and both legitimately solve the task, exact match wrongly labels a success as a failure. Free-form text outputs with multiple valid answers also resist string equality. You need fuzzy / semantic matching.

4b. Outcome evals vs trajectory evals

Outcome / end-state evalTrajectory eval
JudgesThe final result — “did the task succeed?”The whole execution path: tools chosen, intermediate reasoning, turns
StrengthTolerates many valid paths; Anthropic leans here for research agents and judges outcomes rather than processesCatches failures masked by a correct-looking answer (e.g. skipped identity verification before a refund)
WeaknessAn agent “can call every tool correctly and still fail the task,” and a correct answer can hide broken reasoning / lucky-but-wrong tool callsBrittle if matched too strictly; needs a reference trajectory

Trajectory matching modes:

  • Exact match — tool calls + reasoning must match the reference exactly (strict, brittle).
  • In-order — required tools called in correct sequence; intermediate detail may vary.
  • Any-order — all required tools called, sequence ignored.

Three eval levels to run together:

end-to-end   →  did the task succeed?            (outcome)
trajectory   →  was the path efficient & sound?  (process)
component    →  which retriever/tool/subagent broke? (isolation)

Use deterministic checks for exact things (tool-name correctness, output format) and LLM-as-judge for anything depending on free-form output.

4c. LLM-as-judge

  • Anthropic’s implementation: a single LLM call scoring output against a rubric on factual accuracy, citation accuracy, completeness, source quality, tool efficiency — producing a 0.0–1.0 score plus a pass/fail grade. It scaled to “hundreds of outputs” and “aligned well with human judgment.”
  • Caveats from the literature: no judge excels everywhere. Mitigate with structured rubrics that constrain output, multiple judge passes with aggregation, calibration against human-labeled examples, and monitoring for consistency drift over time. Production studies of multi-turn transaction agents still find judges miss a meaningful fraction of failures (the “catching ~one in five” blind-spot framing). Judges are necessary but not sufficient.
  • Agent-as-judge is the emerging extension: use an agent to evaluate another agent’s full trajectory, not just its end result.

4d. Small, high-signal eval sets

Anthropic started with ~20 queries representing real usage. Early development moved success from 30% → 80% — an effect so large that a tiny set was enough to detect it. Start evaluating immediately; don’t wait for a large suite, and resist over-engineering eval infrastructure before you have any signal.

4e. Human evaluation remains essential

Automation misses subtle biases and edge cases. Anthropic’s concrete example: human testers caught that early agents “consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources” — a quality bias no automated metric flagged. Wire a production feedback loop: failures → annotation queue → corrected trajectories → regression tests, so the same bug can’t reach users twice.


5. Observability & Tracing

5a. Why agent observability ≠ LLM monitoring

“Agent failures appear in multi-step causal chains, not at the individual-call level, and require full-session trace capture to detect.” Non-deterministic run-to-run behavior makes debugging hard; without tracing you can’t tell a bad search query from poor source selection from a tool failure. Anthropic uses full production tracing and monitors “decision patterns and interaction structures” while deliberately not reading individual conversation content — privacy-preserving root-cause analysis.

5b. OpenTelemetry GenAI semantic conventions (the emerging standard)

Driven by OTel’s GenAI SIG since April 2024, these standardize attribute names/types/enums for LLM calls, agent steps, vector-DB queries, token usage, cost, and quality. Status: most GenAI conventions are still experimental (not API-stable) as of early-mid 2026 — adopt them, but pin versions and expect churn.

Span tree:

invoke_agent                       (top-level agent run)
├── chat                           (one per LLM call)
│   └── execute_tool               (one per tool call)
├── chat
└── embeddings

Agent conventions extend/override the base GenAI span conventions and add concepts for tasks, actions, agents, teams, artifacts, and memory.

Key gen_ai.* attributes:

AttributeMeaning
gen_ai.operation.namechat, invoke_agent, execute_tool, embeddings
gen_ai.request.modelmodel requested (e.g. gpt-4o)
gen_ai.usage.input_tokens / gen_ai.usage.output_tokenstoken counts
gen_ai.response.finish_reasonswhy generation stopped (stop, tool_calls, …)
gen_ai.agent.idagent identifier
gen_ai.tool.nametool invoked
gen_ai.token.typemetric dimension distinguishing input vs output
gen_ai.system_instructions / gen_ai.input.messages / gen_ai.output.messagesprompt/completion content (only when content capture is enabled)

Metrics: gen_ai.client.operation.duration (latency histogram), gen_ai.client.token.usage (token histogram).

Microsoft + Cisco Outshift introduced multi-agent semantic conventions on top of OTel + W3C Trace Context, standardizing telemetry for quality, performance, safety, cost, tool invocations, and inter-agent collaboration. Microsoft Foundry / Azure Monitor App Insights trace LangChain, LangGraph, OpenAI Agents SDK, and Microsoft Agent Framework. Datadog natively supports the OTel GenAI conventions.

5c. Tooling landscape

ToolStrengthOSS / self-hostNotes
LangSmithLangChain/LangGraph-native: node-by-node state diffs, full execution graphs, model+tool breakdowns, replay against new model versionsNo (managed)Least friction on LangChain/LangGraph; 5,000 traces/mo free
LangfuseOperational telemetry + cost analytics; framework-agnostic via OTel; ingests any LLM SDK/agent frameworkYes (Postgres + ClickHouse)Leading OSS observability path
Arize PhoenixEnterprise RAG eval & monitoring; agent trace capture, LLM-as-judge eval, RAG metrics, dataset mgmt; OpenTelemetry-nativeYesOSS eval + observability
W&B WeaveExperimentation & prompt researchPartialBest for experiment tracking

Pricing gotcha worth a callout. Billing per trace penalizes agent complexity: a 50-step autonomous agent costs the same as one LLM call if billed per span, but 50× more if billed per trace. Factor the billing unit into your architecture and vendor choice.


6. Testing & determinism challenges

LLMs are inherently probabilistic, and this defeats naive regression testing.

  • Even at temperature 0, output is not guaranteed identical run-to-run — numerical-precision effects under greedy decoding still cause drift. OpenAI states its API can only be “mostly deterministic” regardless of temperature; the seed parameter improves but does not guarantee reproducibility, because of system updates and load-balancing across different hardware (Thinking Machines Lab traces the root cause to numerical nondeterminism in inference).
  • The same agent config (prompt, tools, model, orchestration) can yield different outputs across invocations due to temperature sampling, model weight/version updates, tool-latency variation, and context-window effects. This makes reproducing a specific failure extraordinarily difficult.

Regression strategy for non-deterministic agents:

  1. Run a replay regression suite as a CI gate first — “fast, cheap, and deterministic” (replay journaled tool/LLM results instead of re-calling live).
  2. Aggregate over many sessions before comparing A/B arms. A single session can diverge within an arm; compare distributions, not point samples. Use the same seed policy across both arms.
  3. Use deterministic assertions for exact things (tool-name correctness, schema/format) and LLM-judge for free-form quality.

Mantra: “Temperature defines how wild the model can be; seeds make that wildness replayable” — but seeds are a partial, not complete, solution.


7. Resumability, checkpointing & durable execution

The compounding-error math (§1) means restarting a failed long run is wasteful. The fix is to persist progress and resume from the failure point.

Anthropic’s production approach (multi-agent research system)

  • Don’t restart on failure — it’s “expensive and frustrating.” Resume from where the agent was when the errors occurred.
  • Combine AI adaptability with deterministic safeguards: retry logic + regular checkpoints, so agents recover gracefully when tools fail.
  • Context-window management for hundreds-of-turns runs: agents summarize completed phases, offload essential info to external memory, then spawn fresh subagents with clean contexts, handing off to maintain continuity before hitting limits.
  • Rainbow deployments: because the system runs almost continuously as a stateful web of prompts/tools, updates gradually shift traffic old→new while running both simultaneously, so in-flight agents aren’t disrupted.

Durable execution

Durable execution = save progress at key points so a process can pause and resume exactly where it left off; state is persisted after every logical step, so a crash resumes from the last checkpoint, not the start.

FrameworkHow it resumesWatch out for
LangGraphPersistence layer saves graph state as checkpoints organized into threads; a StateSnapshot at every super-step; thread_id is the primary key. Interrupt pauses, saves, waits indefinitely (enables human-in-the-loop), then resumes.Checkpointers save state between nodes, not inside a node; the developer must detect/trigger/coordinate resumption.
TemporalReplays an Event History (durable log) to reconstruct in-memory state after a crash; resumes at the exact failure step without re-running completed work.Workflow code must be deterministic. Because LLM calls are non-deterministic, wrap them as “activity” steps whose results are journaled on first execution and never re-run on replay — the key bridge between durable execution and probabilistic LLMs.
Restate / DBOSJournals + stored step results. OpenAI’s Agents SDK docs point to DBOS for long-running agents, human-in-the-loop flows, and preserved progress across failures/restarts.Same determinism discipline at the step boundary.

Caveat (Diagrid): plain checkpointing ≠ full durable execution. Checkpoints give you a save point but leave detection, coordination, and dedup to the developer. Production-grade fault tolerance usually needs a true durable-execution runtime, not just snapshots.


8. Headline numbers worth citing (Anthropic multi-agent system)

ClaimNumber
Multi-agent (Opus 4 lead + Sonnet 4 subagents) vs single-agent Opus 4 on internal research evals+90.2% (esp. breadth-first / parallel queries)
Share of performance variance explained by token usage alone on BrowseComp~80% (model choice + tool-call count add ~15%)
Token cost of multi-agent vs chat~15× (and ~4× a standard chat turn) → reserve for high-value tasks
Research-time reduction from parallel tool calling on complex queriesup to 90% (minutes vs hours)

The takeaway: multi-agent buys large quality gains on parallelizable, breadth-first work, but at a steep token cost — which is exactly why the reliability, eval, and observability discipline in this doc matters. You’re spending 15× the tokens; you’d better be able to tell whether it worked.


Sources

Continue reading