Reliability, Evaluation & Observability
How to keep long-running, multi-agent LLM systems from quietly going off the rails — and how to know when they do.
Research/reference doc · 2026-06-16 · part of the Agent Orchestration suite
1. Why agents break the old reliability model
A single prompt→response call is easy to reason about: one input, one output, fail or succeed. The moment an agent starts to plan, call tools, hold state, and talk to other agents, you inherit every classic distributed-systems failure class — race conditions, partial failures, inconsistent state, cascading errors, deadlock — and then add a probabilistic engine on top of it.
The long-horizon cliff. Frontier models hit near-100% success on tasks a human could do in under ~4 minutes, but success falls below 10% on tasks needing more than ~4 hours of human-equivalent work. As you push agents into longer-horizon work, time-compounding failure modes become the dominant source of cost and unreliability — not raw model IQ.
The compounding-error math. If each step is independently 95% reliable, a 20-step trajectory is only 0.95²⁰ ≈ 36% reliable. Reliability decays geometrically with horizon length:
steps: 1 5 10 20 40
@95%/step: 95% 77% 60% 36% 13%
@99%/step: 99% 95% 90% 82% 67%
Early mistakes rarely stay local — they “cascade into subsequent steps, distorting reasoning, compounding misjudgments, and ultimately derailing the entire trajectory.” Once a cascade starts it is hard to reverse, so early detection and correction matter far more than post-hoc repair.
Anthropic’s framing: “Minor system failures can be catastrophic for agents.” Because agents are long-running and stateful, a small fault deep into a run wastes everything done before it.
Two design consequences flow from this whole section, and the rest of the doc elaborates them:
- Cut horizon length and step count wherever you can (each step is a dice roll).
- Detect-and-correct mid-run beats restart-from-scratch (see §7 resumability).
2. Failure modes
2a. MAST — the empirical taxonomy
The anchor reference is “Why Do Multi-Agent LLM Systems Fail?” (arXiv 2503.13657), the first empirically grounded multi-agent failure taxonomy. It was built from 1,600+ annotated execution traces across 7 popular MAS frameworks, covering coding/math/general-agent tasks on GPT-4, Claude 3, Qwen 2.5, and CodeLlama, validated by 6 expert annotators with inter-annotator agreement Cohen’s κ = 0.88. It defines 14 failure modes in 3 categories.
The central, uncomfortable finding: many failures stem from organizational / system design, not raw model capability. You can’t model-upgrade your way out of a bad orchestration design.
Category-level share of failures:
| Category | Share |
|---|---|
| Specification & system design | ≈ 41.8% |
| Inter-agent misalignment | ≈ 36.9% |
| Task verification & termination | ≈ 21% (remainder) |
| Cat | Failure mode | Definition |
|---|---|---|
| FC1 Specification / System Design | FM-1.1 Disobey task specification | Ignores stated constraints / requirements |
| FM-1.2 Disobey role specification | Acts outside its assigned role boundaries | |
| FM-1.3 Step repetition | Needlessly redoes completed steps → delay/error | |
| FM-1.4 Loss of conversation history | Context truncated; reverts to an earlier state | |
| FM-1.5 Unaware of termination conditions | Doesn’t know when to stop (→ loops) | |
| FC2 Inter-Agent Misalignment | FM-2.1 Conversation reset | Unwarranted restart losing context/progress |
| FM-2.2 Fail to ask for clarification | Proceeds on ambiguous/unclear input | |
| FM-2.3 Task derailment | Drifts toward irrelevant/unproductive actions | |
| FM-2.4 Information withholding | Fails to share insight peers needed | |
| FM-2.5 Ignored other agent’s input | Disregards a collaborator’s recommendation | |
| FM-2.6 Reasoning-action mismatch | Reasoning diverges from the executed action | |
| FC3 Task Verification / Termination | FM-3.1 Premature termination | Ends before objectives are met |
| FM-3.2 No/incomplete verification | Skips checking outcomes → errors propagate | |
| FM-3.3 Incorrect verification | Validation done, but inadequately/wrongly |
2b. Cascading subagent failures
“A misaligned agent produces an output that the next agent accepts as authoritative, and a verification failure that should have caught the problem is never triggered.” Multi-agent systems amplify error across independent agents rather than averaging it out. Anthropic notes that asynchronous orchestration would add parallelism but introduces fresh hazards: “challenges in result coordination, state consistency, and error propagation across the subagents.”
2c. Infinite loops / non-termination
Maps to MAST FM-1.5 (no termination awareness) and FM-1.3 (step repetition). Agents reinterpret goals differently and ping-pong; missing termination conditions are a top specification-class defect. Mitigation: explicit max-steps / max-cost budgets, loop detection (repeated identical tool calls / states), and forced termination.
2d. Tool misuse (“functional hallucination”)
Misusing tools/APIs: wrong tool, invalid arguments, or an unsafe plan. The function-calling / ToolScan literature breaks it into three categories:
| Code | Error | Example |
|---|---|---|
| IFN | Incorrect Function Name | Calls a function not in the available set (a hallucinated tool) |
| IAN | Incorrect Argument Name | Hallucinated parameter names |
| IAV | Incorrect Argument Value | Wrong values, omitted required args, malformed/invalid JSON |
Security angle: an agent told never to read secrets can hallucinate a read_secret capability — and if such a function happens to exist in the environment, the call executes. Hallucinated tool names can become real privilege escalation. Mitigations: strict JSON-Schema / Pydantic tool schemas, validate args before execution, fail closed on invalid args, and write precise tool descriptions with examples and constraints.
2e. Hallucinated coordination
Agents fabricate that a handoff or communication occurred, or act on outputs another agent never produced. Overlaps FM-2.6 (reasoning-action mismatch) and FM-2.4/2.5 (communication breakdowns), plus “format mismatches between agents” and “context loss during handoffs.” Treat every inter-agent handoff as an untrusted boundary: validate the payload shape and provenance, don’t assume the upstream agent did what it claimed.
2f. Deadlock / synchronous bottlenecks
Anthropic’s lead agent runs subagents synchronously, blocking until each finishes. This “simplifies coordination, but creates bottlenecks”: the lead can’t steer subagents mid-flight, subagents can’t coordinate with each other, and one slow subagent stalls the whole system. This is the practical head-of-line-blocking shape of “deadlock” in current production multi-agent systems — usually not a true lock cycle, but a slow dependency freezing the pipeline.
3. Guardrails
Guardrails = technical controls, validation layers, and policy enforcement that constrain agent behavior within safety/compliance/operational boundaries — block harmful outputs, block adversarial inputs, and keep agents within authorized scope.
Layered model
┌──────────────────────────────────────────────────────────┐
│ INPUT VALIDATION context filtering, access control, │
│ prompt-injection / jailbreak detection,│
│ relevance + blocklist + safety classify│
├──────────────────────────────────────────────────────────┤
│ ACTION GUARDRAILS per tool call: │
│ (every tool call) • authorization check │
│ • schema validation on arguments │
│ • rate limit │
│ • cost meter / budget ceiling │
│ • kill switch │
├──────────────────────────────────────────────────────────┤
│ OUTPUT VALIDATION JSON-schema compliance BEFORE anything │
│ downstream consumes the output │
├──────────────────────────────────────────────────────────┤
│ PERMISSION GATING least privilege per tool; │
│ act only within authorized scope │
├──────────────────────────────────────────────────────────┤
│ HUMAN-IN-THE-LOOP pause the run for approval when risk │
│ exceeds an automation threshold │
└──────────────────────────────────────────────────────────┘
- Input validation — block requests carrying sensitive data, detect prompt injection/jailbreaks, prevent connecting to unauthorized APIs; relevance/keyword/blocklist filters and safety classifiers.
- Action-level guardrails — wrap every tool call with authorization, argument schema validation, rate limit, cost meter, and a kill switch. Whitelist API calls; put budget ceilings on LLM spend so a misconfigured agent can’t spiral.
- Output validation — for structured outputs feeding downstream APIs, enforce JSON-schema compliance and validate before anything consumes it.
- Human-in-the-loop checkpoints — inject human judgment when risk exceeds automation thresholds; require human approval for production changes and high-stakes actions.
Vendor guidance
- OpenAI (A Practical Guide to Building Agents + Agents SDK): guardrails are functions or agents enforcing jailbreak prevention, relevance, keyword filtering, blocklists, and safety classification — “critical at every stage, from input filtering and tool use to human-in-the-loop intervention.” Human review pauses the run so a person/policy approves or rejects a sensitive action; this is especially important early in deployment to surface failures and edge cases and bootstrap the eval cycle. OpenAI’s Guardrails library ships an eval framework reporting precision, recall, F1 against labeled test data.
- LangGraph has native tool scoping and human-in-the-loop primitives (interrupt / approve). Proxy layers (Invariant Gateway, agentgateway) centralize cross-agent policy enforcement.
PF360 house-rule tie-in (CLAUDE.md §0). If a guardrail or integration isn’t configured, never show a broken/empty/silent-failure state. Hide or disable the control, explain in plain language what’s missing, and link to where to set it up — e.g. “No safety classifier configured. Set one up → Settings › Guardrails.”
4. Evaluation
4a. Why exact-match fails for agents
Agents don’t follow predetermined paths. Anthropic: “Even with identical starting points, agents might take completely different valid paths to reach their goal.” Step-by-step exact validation is therefore impossible. Concretely: if the gold trajectory calls search_web but the agent calls search_documentation and both legitimately solve the task, exact match wrongly labels a success as a failure. Free-form text outputs with multiple valid answers also resist string equality. You need fuzzy / semantic matching.
4b. Outcome evals vs trajectory evals
| Outcome / end-state eval | Trajectory eval | |
|---|---|---|
| Judges | The final result — “did the task succeed?” | The whole execution path: tools chosen, intermediate reasoning, turns |
| Strength | Tolerates many valid paths; Anthropic leans here for research agents and judges outcomes rather than processes | Catches failures masked by a correct-looking answer (e.g. skipped identity verification before a refund) |
| Weakness | An agent “can call every tool correctly and still fail the task,” and a correct answer can hide broken reasoning / lucky-but-wrong tool calls | Brittle if matched too strictly; needs a reference trajectory |
Trajectory matching modes:
- Exact match — tool calls + reasoning must match the reference exactly (strict, brittle).
- In-order — required tools called in correct sequence; intermediate detail may vary.
- Any-order — all required tools called, sequence ignored.
Three eval levels to run together:
end-to-end → did the task succeed? (outcome)
trajectory → was the path efficient & sound? (process)
component → which retriever/tool/subagent broke? (isolation)
Use deterministic checks for exact things (tool-name correctness, output format) and LLM-as-judge for anything depending on free-form output.
4c. LLM-as-judge
- Anthropic’s implementation: a single LLM call scoring output against a rubric on factual accuracy, citation accuracy, completeness, source quality, tool efficiency — producing a 0.0–1.0 score plus a pass/fail grade. It scaled to “hundreds of outputs” and “aligned well with human judgment.”
- Caveats from the literature: no judge excels everywhere. Mitigate with structured rubrics that constrain output, multiple judge passes with aggregation, calibration against human-labeled examples, and monitoring for consistency drift over time. Production studies of multi-turn transaction agents still find judges miss a meaningful fraction of failures (the “catching ~one in five” blind-spot framing). Judges are necessary but not sufficient.
- Agent-as-judge is the emerging extension: use an agent to evaluate another agent’s full trajectory, not just its end result.
4d. Small, high-signal eval sets
Anthropic started with ~20 queries representing real usage. Early development moved success from 30% → 80% — an effect so large that a tiny set was enough to detect it. Start evaluating immediately; don’t wait for a large suite, and resist over-engineering eval infrastructure before you have any signal.
4e. Human evaluation remains essential
Automation misses subtle biases and edge cases. Anthropic’s concrete example: human testers caught that early agents “consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources” — a quality bias no automated metric flagged. Wire a production feedback loop: failures → annotation queue → corrected trajectories → regression tests, so the same bug can’t reach users twice.
5. Observability & Tracing
5a. Why agent observability ≠ LLM monitoring
“Agent failures appear in multi-step causal chains, not at the individual-call level, and require full-session trace capture to detect.” Non-deterministic run-to-run behavior makes debugging hard; without tracing you can’t tell a bad search query from poor source selection from a tool failure. Anthropic uses full production tracing and monitors “decision patterns and interaction structures” while deliberately not reading individual conversation content — privacy-preserving root-cause analysis.
5b. OpenTelemetry GenAI semantic conventions (the emerging standard)
Driven by OTel’s GenAI SIG since April 2024, these standardize attribute names/types/enums for LLM calls, agent steps, vector-DB queries, token usage, cost, and quality. Status: most GenAI conventions are still experimental (not API-stable) as of early-mid 2026 — adopt them, but pin versions and expect churn.
Span tree:
invoke_agent (top-level agent run)
├── chat (one per LLM call)
│ └── execute_tool (one per tool call)
├── chat
└── embeddings
Agent conventions extend/override the base GenAI span conventions and add concepts for tasks, actions, agents, teams, artifacts, and memory.
Key gen_ai.* attributes:
| Attribute | Meaning |
|---|---|
gen_ai.operation.name | chat, invoke_agent, execute_tool, embeddings |
gen_ai.request.model | model requested (e.g. gpt-4o) |
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens | token counts |
gen_ai.response.finish_reasons | why generation stopped (stop, tool_calls, …) |
gen_ai.agent.id | agent identifier |
gen_ai.tool.name | tool invoked |
gen_ai.token.type | metric dimension distinguishing input vs output |
gen_ai.system_instructions / gen_ai.input.messages / gen_ai.output.messages | prompt/completion content (only when content capture is enabled) |
Metrics: gen_ai.client.operation.duration (latency histogram), gen_ai.client.token.usage (token histogram).
Microsoft + Cisco Outshift introduced multi-agent semantic conventions on top of OTel + W3C Trace Context, standardizing telemetry for quality, performance, safety, cost, tool invocations, and inter-agent collaboration. Microsoft Foundry / Azure Monitor App Insights trace LangChain, LangGraph, OpenAI Agents SDK, and Microsoft Agent Framework. Datadog natively supports the OTel GenAI conventions.
5c. Tooling landscape
| Tool | Strength | OSS / self-host | Notes |
|---|---|---|---|
| LangSmith | LangChain/LangGraph-native: node-by-node state diffs, full execution graphs, model+tool breakdowns, replay against new model versions | No (managed) | Least friction on LangChain/LangGraph; 5,000 traces/mo free |
| Langfuse | Operational telemetry + cost analytics; framework-agnostic via OTel; ingests any LLM SDK/agent framework | Yes (Postgres + ClickHouse) | Leading OSS observability path |
| Arize Phoenix | Enterprise RAG eval & monitoring; agent trace capture, LLM-as-judge eval, RAG metrics, dataset mgmt; OpenTelemetry-native | Yes | OSS eval + observability |
| W&B Weave | Experimentation & prompt research | Partial | Best for experiment tracking |
Pricing gotcha worth a callout. Billing per trace penalizes agent complexity: a 50-step autonomous agent costs the same as one LLM call if billed per span, but 50× more if billed per trace. Factor the billing unit into your architecture and vendor choice.
6. Testing & determinism challenges
LLMs are inherently probabilistic, and this defeats naive regression testing.
- Even at temperature 0, output is not guaranteed identical run-to-run — numerical-precision effects under greedy decoding still cause drift. OpenAI states its API can only be “mostly deterministic” regardless of temperature; the
seedparameter improves but does not guarantee reproducibility, because of system updates and load-balancing across different hardware (Thinking Machines Lab traces the root cause to numerical nondeterminism in inference). - The same agent config (prompt, tools, model, orchestration) can yield different outputs across invocations due to temperature sampling, model weight/version updates, tool-latency variation, and context-window effects. This makes reproducing a specific failure extraordinarily difficult.
Regression strategy for non-deterministic agents:
- Run a replay regression suite as a CI gate first — “fast, cheap, and deterministic” (replay journaled tool/LLM results instead of re-calling live).
- Aggregate over many sessions before comparing A/B arms. A single session can diverge within an arm; compare distributions, not point samples. Use the same seed policy across both arms.
- Use deterministic assertions for exact things (tool-name correctness, schema/format) and LLM-judge for free-form quality.
Mantra: “Temperature defines how wild the model can be; seeds make that wildness replayable” — but seeds are a partial, not complete, solution.
7. Resumability, checkpointing & durable execution
The compounding-error math (§1) means restarting a failed long run is wasteful. The fix is to persist progress and resume from the failure point.
Anthropic’s production approach (multi-agent research system)
- Don’t restart on failure — it’s “expensive and frustrating.” Resume from where the agent was when the errors occurred.
- Combine AI adaptability with deterministic safeguards: retry logic + regular checkpoints, so agents recover gracefully when tools fail.
- Context-window management for hundreds-of-turns runs: agents summarize completed phases, offload essential info to external memory, then spawn fresh subagents with clean contexts, handing off to maintain continuity before hitting limits.
- Rainbow deployments: because the system runs almost continuously as a stateful web of prompts/tools, updates gradually shift traffic old→new while running both simultaneously, so in-flight agents aren’t disrupted.
Durable execution
Durable execution = save progress at key points so a process can pause and resume exactly where it left off; state is persisted after every logical step, so a crash resumes from the last checkpoint, not the start.
| Framework | How it resumes | Watch out for |
|---|---|---|
| LangGraph | Persistence layer saves graph state as checkpoints organized into threads; a StateSnapshot at every super-step; thread_id is the primary key. Interrupt pauses, saves, waits indefinitely (enables human-in-the-loop), then resumes. | Checkpointers save state between nodes, not inside a node; the developer must detect/trigger/coordinate resumption. |
| Temporal | Replays an Event History (durable log) to reconstruct in-memory state after a crash; resumes at the exact failure step without re-running completed work. | Workflow code must be deterministic. Because LLM calls are non-deterministic, wrap them as “activity” steps whose results are journaled on first execution and never re-run on replay — the key bridge between durable execution and probabilistic LLMs. |
| Restate / DBOS | Journals + stored step results. OpenAI’s Agents SDK docs point to DBOS for long-running agents, human-in-the-loop flows, and preserved progress across failures/restarts. | Same determinism discipline at the step boundary. |
Caveat (Diagrid): plain checkpointing ≠ full durable execution. Checkpoints give you a save point but leave detection, coordination, and dedup to the developer. Production-grade fault tolerance usually needs a true durable-execution runtime, not just snapshots.
8. Headline numbers worth citing (Anthropic multi-agent system)
| Claim | Number |
|---|---|
| Multi-agent (Opus 4 lead + Sonnet 4 subagents) vs single-agent Opus 4 on internal research evals | +90.2% (esp. breadth-first / parallel queries) |
| Share of performance variance explained by token usage alone on BrowseComp | ~80% (model choice + tool-call count add ~15%) |
| Token cost of multi-agent vs chat | ~15× (and ~4× a standard chat turn) → reserve for high-value tasks |
| Research-time reduction from parallel tool calling on complex queries | up to 90% (minutes vs hours) |
The takeaway: multi-agent buys large quality gains on parallelizable, breadth-first work, but at a steep token cost — which is exactly why the reliability, eval, and observability discipline in this doc matters. You’re spending 15× the tokens; you’d better be able to tell whether it worked.
Sources
- Anthropic — How we built our multi-agent research system — https://www.anthropic.com/engineering/built-multi-agent-research-system
- Why Do Multi-Agent LLM Systems Fail? (MAST, arXiv 2503.13657) — https://arxiv.org/abs/2503.13657 · HTML: https://arxiv.org/html/2503.13657v1
- Where LLM Agents Fail and How They Can Learn From Failures (arXiv 2509.25370) — https://arxiv.org/pdf/2509.25370
- Characterizing Faults in Agentic AI: A Taxonomy — https://arxiv.org/html/2603.06847v1
- Looking Forward: Challenges & Opportunities in Agentic AI Reliability (arXiv 2511.11921) — https://arxiv.org/pdf/2511.11921
- The Compounding Errors Problem (Zartis) — https://www.zartis.com/the-compounding-errors-problem-why-multi-agent-systems-fail-and-the-architecture-that-fixes-it/
- ToolScan: Benchmark for Errors in Tool-Use LLMs (arXiv 2411.13547) — https://arxiv.org/pdf/2411.13547
- Reducing Tool Hallucination via Reliability Alignment (arXiv 2412.04141) — https://arxiv.org/pdf/2412.04141
- The unauthorized tool call problem (Answer.AI) — https://www.answer.ai/posts/2026-01-20-toolcalling.html
- AI Agent Hallucinations: Causes, Types, Tool Errors — https://manveerc.substack.com/p/ai-agent-hallucinations-prevention
- LangChain — LLM Evaluation Framework: Trajectories vs Outputs — https://www.langchain.com/resources/llm-evaluation-framework
- When AIs Judge AIs: Agent-as-a-Judge (arXiv 2508.02994) — https://arxiv.org/pdf/2508.02994
- Catching One in Five: LLM-as-Judge Blind Spots (arXiv 2606.10315) — https://arxiv.org/html/2606.10315
- Confident AI — LLM Agent Evaluation Metrics 2026 — https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide
- OpenTelemetry — GenAI agent & framework spans (semconv) — https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- OpenTelemetry blog — Inside the LLM Call: GenAI Observability with OTel — https://opentelemetry.io/blog/2026/genai-observability/
- OTel GenAI agentic systems semconv issue #2664 — https://github.com/open-telemetry/semantic-conventions/issues/2664
- Datadog — native OTel GenAI semantic conventions — https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- Greptime — How OpenTelemetry Traces LLM Calls, Agent Reasoning, MCP Tools — https://greptime.com/blogs/2026-05-09-opentelemetry-genai-semantic-conventions
- Microsoft Learn — Agent tracing in Microsoft Foundry — https://learn.microsoft.com/en-us/azure/foundry/observability/concepts/trace-agent-concept
- Latitude — Best AI Agent Observability Tools 2026 — https://latitude.so/blog/best-ai-agent-observability-tools-2026-comparison
- Langfuse — Phoenix/Arize alternatives comparison — https://langfuse.com/faq/all/best-phoenix-arize-alternatives
- digitalapplied — Agent Observability: LangSmith, Langfuse, Arize 2026 — https://www.digitalapplied.com/blog/agent-observability-platforms-langsmith-langfuse-arize-2026
- Pydantic — AI Observability Pricing Compared — https://pydantic.dev/articles/ai-observability-pricing-comparison
- OpenAI — A Practical Guide to Building Agents (PDF) — https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf
- OpenAI — Guardrails and human review — https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
- OpenAI — Safety in building agents — https://developers.openai.com/api/docs/guides/agent-builder-safety/
- Domino.ai — Composable guardrails for agentic AI — https://domino.ai/resources/blueprints/composable-guardrails-for-agentic-ai
- ToolHalla — AI Agent Guardrails & Output Validation 2026 — https://toolhalla.ai/blog/ai-agent-guardrails-io-validation-2026
- Thinking Machines Lab — Defeating Nondeterminism in LLM Inference — https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- Understanding/Mitigating Numerical Nondeterminism in LLM Inference (arXiv 2506.09501) — https://arxiv.org/pdf/2506.09501
- LLM Determinism in Prod: Temperature, Seeds, Replayable Results — https://medium.com/@2nick2patel2/llm-determinism-in-prod-temperature-seeds-and-replayable-results-8f3797583eb1
- Trust at Scale: Regression Testing Multi-Agent Systems — https://bhargavaparv.medium.com/trust-at-scale-regression-testing-multi-agent-systems-in-continuous-deployment-environments-99dfcc5872e9
- LangGraph — Durable execution docs — https://docs.langchain.com/oss/python/langgraph/durable-execution
- LangGraph vs Temporal for AI Agents (durable execution) — https://medium.com/data-science-collective/langgraph-vs-temporal-for-ai-agents-durable-execution-architecture-beyond-for-loops-a1f640d35f02
- Diagrid — Checkpoints Are Not Durable Execution — https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows
- Zylos Research — Durable Execution Patterns for AI Agents — https://zylos.ai/research/2026-02-17-durable-execution-ai-agents