Reliability, Evaluation & Observability

By Pritesh Yadav June 16, 2026 19 min read —

How to keep long-running, multi-agent LLM systems from quietly going off the rails — and how to know when they do.

Research/reference doc · 2026-06-16 · part of the Agent Orchestration suite

1. Why agents break the old reliability model

A single prompt→response call is easy to reason about: one input, one output, fail or succeed. The moment an agent starts to plan, call tools, hold state, and talk to other agents, you inherit every classic distributed-systems failure class — race conditions, partial failures, inconsistent state, cascading errors, deadlock — and then add a probabilistic engine on top of it.

The long-horizon cliff. Frontier models hit near-100% success on tasks a human could do in under ~4 minutes, but success falls below 10% on tasks needing more than ~4 hours of human-equivalent work. As you push agents into longer-horizon work, time-compounding failure modes become the dominant source of cost and unreliability — not raw model IQ.

The compounding-error math. If each step is independently 95% reliable, a 20-step trajectory is only 0.95²⁰ ≈ 36% reliable. Reliability decays geometrically with horizon length:

steps:        1     5     10     20     40
@95%/step:   95%   77%   60%    36%    13%
@99%/step:   99%   95%   90%    82%    67%

Early mistakes rarely stay local — they “cascade into subsequent steps, distorting reasoning, compounding misjudgments, and ultimately derailing the entire trajectory.” Once a cascade starts it is hard to reverse, so early detection and correction matter far more than post-hoc repair.

Anthropic’s framing: “Minor system failures can be catastrophic for agents.” Because agents are long-running and stateful, a small fault deep into a run wastes everything done before it.

Two design consequences flow from this whole section, and the rest of the doc elaborates them:

Cut horizon length and step count wherever you can (each step is a dice roll).
Detect-and-correct mid-run beats restart-from-scratch (see §7 resumability).

2. Failure modes

2a. MAST — the empirical taxonomy

The anchor reference is “Why Do Multi-Agent LLM Systems Fail?” (arXiv 2503.13657), the first empirically grounded multi-agent failure taxonomy. It was built from 1,600+ annotated execution traces across 7 popular MAS frameworks, covering coding/math/general-agent tasks on GPT-4, Claude 3, Qwen 2.5, and CodeLlama, validated by 6 expert annotators with inter-annotator agreement Cohen’s κ = 0.88. It defines 14 failure modes in 3 categories.

The central, uncomfortable finding: many failures stem from organizational / system design, not raw model capability. You can’t model-upgrade your way out of a bad orchestration design.

Category-level share of failures:

Category	Share
Specification & system design	≈ 41.8%
Inter-agent misalignment	≈ 36.9%
Task verification & termination	≈ 21% (remainder)

Cat	Failure mode	Definition
FC1 Specification / System Design	FM-1.1 Disobey task specification	Ignores stated constraints / requirements
	FM-1.2 Disobey role specification	Acts outside its assigned role boundaries
	FM-1.3 Step repetition	Needlessly redoes completed steps → delay/error
	FM-1.4 Loss of conversation history	Context truncated; reverts to an earlier state
	FM-1.5 Unaware of termination conditions	Doesn’t know when to stop (→ loops)
FC2 Inter-Agent Misalignment	FM-2.1 Conversation reset	Unwarranted restart losing context/progress
	FM-2.2 Fail to ask for clarification	Proceeds on ambiguous/unclear input
	FM-2.3 Task derailment	Drifts toward irrelevant/unproductive actions
	FM-2.4 Information withholding	Fails to share insight peers needed
	FM-2.5 Ignored other agent’s input	Disregards a collaborator’s recommendation
	FM-2.6 Reasoning-action mismatch	Reasoning diverges from the executed action
FC3 Task Verification / Termination	FM-3.1 Premature termination	Ends before objectives are met
	FM-3.2 No/incomplete verification	Skips checking outcomes → errors propagate
	FM-3.3 Incorrect verification	Validation done, but inadequately/wrongly

2b. Cascading subagent failures

“A misaligned agent produces an output that the next agent accepts as authoritative, and a verification failure that should have caught the problem is never triggered.” Multi-agent systems amplify error across independent agents rather than averaging it out. Anthropic notes that asynchronous orchestration would add parallelism but introduces fresh hazards: “challenges in result coordination, state consistency, and error propagation across the subagents.”

2c. Infinite loops / non-termination

Maps to MAST FM-1.5 (no termination awareness) and FM-1.3 (step repetition). Agents reinterpret goals differently and ping-pong; missing termination conditions are a top specification-class defect. Mitigation: explicit max-steps / max-cost budgets, loop detection (repeated identical tool calls / states), and forced termination.

2d. Tool misuse (“functional hallucination”)

Misusing tools/APIs: wrong tool, invalid arguments, or an unsafe plan. The function-calling / ToolScan literature breaks it into three categories:

Code	Error	Example
IFN	Incorrect Function Name	Calls a function not in the available set (a hallucinated tool)
IAN	Incorrect Argument Name	Hallucinated parameter names
IAV	Incorrect Argument Value	Wrong values, omitted required args, malformed/invalid JSON

Security angle: an agent told never to read secrets can hallucinate a read_secret capability — and if such a function happens to exist in the environment, the call executes. Hallucinated tool names can become real privilege escalation. Mitigations: strict JSON-Schema / Pydantic tool schemas, validate args before execution, fail closed on invalid args, and write precise tool descriptions with examples and constraints.

2e. Hallucinated coordination

Agents fabricate that a handoff or communication occurred, or act on outputs another agent never produced. Overlaps FM-2.6 (reasoning-action mismatch) and FM-2.4/2.5 (communication breakdowns), plus “format mismatches between agents” and “context loss during handoffs.” Treat every inter-agent handoff as an untrusted boundary: validate the payload shape and provenance, don’t assume the upstream agent did what it claimed.

2f. Deadlock / synchronous bottlenecks

Anthropic’s lead agent runs subagents synchronously, blocking until each finishes. This “simplifies coordination, but creates bottlenecks”: the lead can’t steer subagents mid-flight, subagents can’t coordinate with each other, and one slow subagent stalls the whole system. This is the practical head-of-line-blocking shape of “deadlock” in current production multi-agent systems — usually not a true lock cycle, but a slow dependency freezing the pipeline.

3. Guardrails

Guardrails = technical controls, validation layers, and policy enforcement that constrain agent behavior within safety/compliance/operational boundaries — block harmful outputs, block adversarial inputs, and keep agents within authorized scope.

Layered model

┌──────────────────────────────────────────────────────────┐
│ INPUT VALIDATION    context filtering, access control,     │
│                     prompt-injection / jailbreak detection,│
│                     relevance + blocklist + safety classify│
├──────────────────────────────────────────────────────────┤
│ ACTION GUARDRAILS   per tool call:                         │
│  (every tool call)   • authorization check                 │
│                      • schema validation on arguments      │
│                      • rate limit                          │
│                      • cost meter / budget ceiling         │
│                      • kill switch                         │
├──────────────────────────────────────────────────────────┤
│ OUTPUT VALIDATION   JSON-schema compliance BEFORE anything │
│                     downstream consumes the output         │
├──────────────────────────────────────────────────────────┤
│ PERMISSION GATING   least privilege per tool;              │
│                     act only within authorized scope       │
├──────────────────────────────────────────────────────────┤
│ HUMAN-IN-THE-LOOP   pause the run for approval when risk   │
│                     exceeds an automation threshold        │
└──────────────────────────────────────────────────────────┘

Input validation — block requests carrying sensitive data, detect prompt injection/jailbreaks, prevent connecting to unauthorized APIs; relevance/keyword/blocklist filters and safety classifiers.
Action-level guardrails — wrap every tool call with authorization, argument schema validation, rate limit, cost meter, and a kill switch. Whitelist API calls; put budget ceilings on LLM spend so a misconfigured agent can’t spiral.
Output validation — for structured outputs feeding downstream APIs, enforce JSON-schema compliance and validate before anything consumes it.
Human-in-the-loop checkpoints — inject human judgment when risk exceeds automation thresholds; require human approval for production changes and high-stakes actions.

Vendor guidance

OpenAI (A Practical Guide to Building Agents + Agents SDK): guardrails are functions or agents enforcing jailbreak prevention, relevance, keyword filtering, blocklists, and safety classification — “critical at every stage, from input filtering and tool use to human-in-the-loop intervention.” Human review pauses the run so a person/policy approves or rejects a sensitive action; this is especially important early in deployment to surface failures and edge cases and bootstrap the eval cycle. OpenAI’s Guardrails library ships an eval framework reporting precision, recall, F1 against labeled test data.
LangGraph has native tool scoping and human-in-the-loop primitives (interrupt / approve). Proxy layers (Invariant Gateway, agentgateway) centralize cross-agent policy enforcement.

PF360 house-rule tie-in (CLAUDE.md §0). If a guardrail or integration isn’t configured, never show a broken/empty/silent-failure state. Hide or disable the control, explain in plain language what’s missing, and link to where to set it up — e.g. “No safety classifier configured. Set one up → Settings › Guardrails.”

4. Evaluation

4a. Why exact-match fails for agents

Agents don’t follow predetermined paths. Anthropic: “Even with identical starting points, agents might take completely different valid paths to reach their goal.” Step-by-step exact validation is therefore impossible. Concretely: if the gold trajectory calls search_web but the agent calls search_documentation and both legitimately solve the task, exact match wrongly labels a success as a failure. Free-form text outputs with multiple valid answers also resist string equality. You need fuzzy / semantic matching.

4b. Outcome evals vs trajectory evals

	Outcome / end-state eval	Trajectory eval
Judges	The final result — “did the task succeed?”	The whole execution path: tools chosen, intermediate reasoning, turns
Strength	Tolerates many valid paths; Anthropic leans here for research agents and judges outcomes rather than processes	Catches failures masked by a correct-looking answer (e.g. skipped identity verification before a refund)
Weakness	An agent “can call every tool correctly and still fail the task,” and a correct answer can hide broken reasoning / lucky-but-wrong tool calls	Brittle if matched too strictly; needs a reference trajectory

Trajectory matching modes:

Exact match — tool calls + reasoning must match the reference exactly (strict, brittle).
In-order — required tools called in correct sequence; intermediate detail may vary.
Any-order — all required tools called, sequence ignored.

Three eval levels to run together:

end-to-end   →  did the task succeed?            (outcome)
trajectory   →  was the path efficient & sound?  (process)
component    →  which retriever/tool/subagent broke? (isolation)

Use deterministic checks for exact things (tool-name correctness, output format) and LLM-as-judge for anything depending on free-form output.

4c. LLM-as-judge

Anthropic’s implementation: a single LLM call scoring output against a rubric on factual accuracy, citation accuracy, completeness, source quality, tool efficiency — producing a 0.0–1.0 score plus a pass/fail grade. It scaled to “hundreds of outputs” and “aligned well with human judgment.”
Caveats from the literature: no judge excels everywhere. Mitigate with structured rubrics that constrain output, multiple judge passes with aggregation, calibration against human-labeled examples, and monitoring for consistency drift over time. Production studies of multi-turn transaction agents still find judges miss a meaningful fraction of failures (the “catching ~one in five” blind-spot framing). Judges are necessary but not sufficient.
Agent-as-judge is the emerging extension: use an agent to evaluate another agent’s full trajectory, not just its end result.

4d. Small, high-signal eval sets

Anthropic started with ~20 queries representing real usage. Early development moved success from 30% → 80% — an effect so large that a tiny set was enough to detect it. Start evaluating immediately; don’t wait for a large suite, and resist over-engineering eval infrastructure before you have any signal.

4e. Human evaluation remains essential

Automation misses subtle biases and edge cases. Anthropic’s concrete example: human testers caught that early agents “consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources” — a quality bias no automated metric flagged. Wire a production feedback loop: failures → annotation queue → corrected trajectories → regression tests, so the same bug can’t reach users twice.

5. Observability & Tracing

5a. Why agent observability ≠ LLM monitoring

“Agent failures appear in multi-step causal chains, not at the individual-call level, and require full-session trace capture to detect.” Non-deterministic run-to-run behavior makes debugging hard; without tracing you can’t tell a bad search query from poor source selection from a tool failure. Anthropic uses full production tracing and monitors “decision patterns and interaction structures” while deliberately not reading individual conversation content — privacy-preserving root-cause analysis.

5b. OpenTelemetry GenAI semantic conventions (the emerging standard)

Driven by OTel’s GenAI SIG since April 2024, these standardize attribute names/types/enums for LLM calls, agent steps, vector-DB queries, token usage, cost, and quality. Status: most GenAI conventions are still experimental (not API-stable) as of early-mid 2026 — adopt them, but pin versions and expect churn.

Span tree:

invoke_agent                       (top-level agent run)
├── chat                           (one per LLM call)
│   └── execute_tool               (one per tool call)
├── chat
└── embeddings

Agent conventions extend/override the base GenAI span conventions and add concepts for tasks, actions, agents, teams, artifacts, and memory.

Key gen_ai.* attributes:

Attribute	Meaning
`gen_ai.operation.name`	`chat`, `invoke_agent`, `execute_tool`, `embeddings`
`gen_ai.request.model`	model requested (e.g. `gpt-4o`)
`gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens`	token counts
`gen_ai.response.finish_reasons`	why generation stopped (`stop`, `tool_calls`, …)
`gen_ai.agent.id`	agent identifier
`gen_ai.tool.name`	tool invoked
`gen_ai.token.type`	metric dimension distinguishing input vs output
`gen_ai.system_instructions` / `gen_ai.input.messages` / `gen_ai.output.messages`	prompt/completion content (only when content capture is enabled)

Metrics: gen_ai.client.operation.duration (latency histogram), gen_ai.client.token.usage (token histogram).

Microsoft + Cisco Outshift introduced multi-agent semantic conventions on top of OTel + W3C Trace Context, standardizing telemetry for quality, performance, safety, cost, tool invocations, and inter-agent collaboration. Microsoft Foundry / Azure Monitor App Insights trace LangChain, LangGraph, OpenAI Agents SDK, and Microsoft Agent Framework. Datadog natively supports the OTel GenAI conventions.

5c. Tooling landscape

Tool	Strength	OSS / self-host	Notes
LangSmith	LangChain/LangGraph-native: node-by-node state diffs, full execution graphs, model+tool breakdowns, replay against new model versions	No (managed)	Least friction on LangChain/LangGraph; 5,000 traces/mo free
Langfuse	Operational telemetry + cost analytics; framework-agnostic via OTel; ingests any LLM SDK/agent framework	Yes (Postgres + ClickHouse)	Leading OSS observability path
Arize Phoenix	Enterprise RAG eval & monitoring; agent trace capture, LLM-as-judge eval, RAG metrics, dataset mgmt; OpenTelemetry-native	Yes	OSS eval + observability
W&B Weave	Experimentation & prompt research	Partial	Best for experiment tracking

Pricing gotcha worth a callout. Billing per trace penalizes agent complexity: a 50-step autonomous agent costs the same as one LLM call if billed per span, but 50× more if billed per trace. Factor the billing unit into your architecture and vendor choice.

6. Testing & determinism challenges

LLMs are inherently probabilistic, and this defeats naive regression testing.

Even at temperature 0, output is not guaranteed identical run-to-run — numerical-precision effects under greedy decoding still cause drift. OpenAI states its API can only be “mostly deterministic” regardless of temperature; the seed parameter improves but does not guarantee reproducibility, because of system updates and load-balancing across different hardware (Thinking Machines Lab traces the root cause to numerical nondeterminism in inference).
The same agent config (prompt, tools, model, orchestration) can yield different outputs across invocations due to temperature sampling, model weight/version updates, tool-latency variation, and context-window effects. This makes reproducing a specific failure extraordinarily difficult.

Regression strategy for non-deterministic agents:

Run a replay regression suite as a CI gate first — “fast, cheap, and deterministic” (replay journaled tool/LLM results instead of re-calling live).
Aggregate over many sessions before comparing A/B arms. A single session can diverge within an arm; compare distributions, not point samples. Use the same seed policy across both arms.
Use deterministic assertions for exact things (tool-name correctness, schema/format) and LLM-judge for free-form quality.

Mantra: “Temperature defines how wild the model can be; seeds make that wildness replayable” — but seeds are a partial, not complete, solution.

7. Resumability, checkpointing & durable execution

The compounding-error math (§1) means restarting a failed long run is wasteful. The fix is to persist progress and resume from the failure point.

Anthropic’s production approach (multi-agent research system)

Don’t restart on failure — it’s “expensive and frustrating.” Resume from where the agent was when the errors occurred.
Combine AI adaptability with deterministic safeguards: retry logic + regular checkpoints, so agents recover gracefully when tools fail.
Context-window management for hundreds-of-turns runs: agents summarize completed phases, offload essential info to external memory, then spawn fresh subagents with clean contexts, handing off to maintain continuity before hitting limits.
Rainbow deployments: because the system runs almost continuously as a stateful web of prompts/tools, updates gradually shift traffic old→new while running both simultaneously, so in-flight agents aren’t disrupted.

Durable execution

Durable execution = save progress at key points so a process can pause and resume exactly where it left off; state is persisted after every logical step, so a crash resumes from the last checkpoint, not the start.

Framework	How it resumes	Watch out for
LangGraph	Persistence layer saves graph state as checkpoints organized into threads; a `StateSnapshot` at every super-step; `thread_id` is the primary key. Interrupt pauses, saves, waits indefinitely (enables human-in-the-loop), then resumes.	Checkpointers save state between nodes, not inside a node; the developer must detect/trigger/coordinate resumption.
Temporal	Replays an Event History (durable log) to reconstruct in-memory state after a crash; resumes at the exact failure step without re-running completed work.	Workflow code must be deterministic. Because LLM calls are non-deterministic, wrap them as “activity” steps whose results are journaled on first execution and never re-run on replay — the key bridge between durable execution and probabilistic LLMs.
Restate / DBOS	Journals + stored step results. OpenAI’s Agents SDK docs point to DBOS for long-running agents, human-in-the-loop flows, and preserved progress across failures/restarts.	Same determinism discipline at the step boundary.

Caveat (Diagrid): plain checkpointing ≠ full durable execution. Checkpoints give you a save point but leave detection, coordination, and dedup to the developer. Production-grade fault tolerance usually needs a true durable-execution runtime, not just snapshots.

8. Headline numbers worth citing (Anthropic multi-agent system)

Claim	Number
Multi-agent (Opus 4 lead + Sonnet 4 subagents) vs single-agent Opus 4 on internal research evals	+90.2% (esp. breadth-first / parallel queries)
Share of performance variance explained by token usage alone on BrowseComp	~80% (model choice + tool-call count add ~15%)
Token cost of multi-agent vs chat	~15× (and ~4× a standard chat turn) → reserve for high-value tasks
Research-time reduction from parallel tool calling on complex queries	up to 90% (minutes vs hours)

The takeaway: multi-agent buys large quality gains on parallelizable, breadth-first work, but at a steep token cost — which is exactly why the reliability, eval, and observability discipline in this doc matters. You’re spending 15× the tokens; you’d better be able to tell whether it worked.

Sources

Anthropic — How we built our multi-agent research system — https://www.anthropic.com/engineering/built-multi-agent-research-system
Why Do Multi-Agent LLM Systems Fail? (MAST, arXiv 2503.13657) — https://arxiv.org/abs/2503.13657 · HTML: https://arxiv.org/html/2503.13657v1
Where LLM Agents Fail and How They Can Learn From Failures (arXiv 2509.25370) — https://arxiv.org/pdf/2509.25370
Characterizing Faults in Agentic AI: A Taxonomy — https://arxiv.org/html/2603.06847v1
Looking Forward: Challenges & Opportunities in Agentic AI Reliability (arXiv 2511.11921) — https://arxiv.org/pdf/2511.11921
The Compounding Errors Problem (Zartis) — https://www.zartis.com/the-compounding-errors-problem-why-multi-agent-systems-fail-and-the-architecture-that-fixes-it/
ToolScan: Benchmark for Errors in Tool-Use LLMs (arXiv 2411.13547) — https://arxiv.org/pdf/2411.13547
Reducing Tool Hallucination via Reliability Alignment (arXiv 2412.04141) — https://arxiv.org/pdf/2412.04141
The unauthorized tool call problem (Answer.AI) — https://www.answer.ai/posts/2026-01-20-toolcalling.html
AI Agent Hallucinations: Causes, Types, Tool Errors — https://manveerc.substack.com/p/ai-agent-hallucinations-prevention
LangChain — LLM Evaluation Framework: Trajectories vs Outputs — https://www.langchain.com/resources/llm-evaluation-framework
When AIs Judge AIs: Agent-as-a-Judge (arXiv 2508.02994) — https://arxiv.org/pdf/2508.02994
Catching One in Five: LLM-as-Judge Blind Spots (arXiv 2606.10315) — https://arxiv.org/html/2606.10315
Confident AI — LLM Agent Evaluation Metrics 2026 — https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide
OpenTelemetry — GenAI agent & framework spans (semconv) — https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
OpenTelemetry blog — Inside the LLM Call: GenAI Observability with OTel — https://opentelemetry.io/blog/2026/genai-observability/
OTel GenAI agentic systems semconv issue #2664 — https://github.com/open-telemetry/semantic-conventions/issues/2664
Datadog — native OTel GenAI semantic conventions — https://www.datadoghq.com/blog/llm-otel-semantic-convention/
Greptime — How OpenTelemetry Traces LLM Calls, Agent Reasoning, MCP Tools — https://greptime.com/blogs/2026-05-09-opentelemetry-genai-semantic-conventions
Microsoft Learn — Agent tracing in Microsoft Foundry — https://learn.microsoft.com/en-us/azure/foundry/observability/concepts/trace-agent-concept
Latitude — Best AI Agent Observability Tools 2026 — https://latitude.so/blog/best-ai-agent-observability-tools-2026-comparison
Langfuse — Phoenix/Arize alternatives comparison — https://langfuse.com/faq/all/best-phoenix-arize-alternatives
digitalapplied — Agent Observability: LangSmith, Langfuse, Arize 2026 — https://www.digitalapplied.com/blog/agent-observability-platforms-langsmith-langfuse-arize-2026
Pydantic — AI Observability Pricing Compared — https://pydantic.dev/articles/ai-observability-pricing-comparison
OpenAI — A Practical Guide to Building Agents (PDF) — https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf
OpenAI — Guardrails and human review — https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
OpenAI — Safety in building agents — https://developers.openai.com/api/docs/guides/agent-builder-safety/
Domino.ai — Composable guardrails for agentic AI — https://domino.ai/resources/blueprints/composable-guardrails-for-agentic-ai
ToolHalla — AI Agent Guardrails & Output Validation 2026 — https://toolhalla.ai/blog/ai-agent-guardrails-io-validation-2026
Thinking Machines Lab — Defeating Nondeterminism in LLM Inference — https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Understanding/Mitigating Numerical Nondeterminism in LLM Inference (arXiv 2506.09501) — https://arxiv.org/pdf/2506.09501
LLM Determinism in Prod: Temperature, Seeds, Replayable Results — https://medium.com/@2nick2patel2/llm-determinism-in-prod-temperature-seeds-and-replayable-results-8f3797583eb1
Trust at Scale: Regression Testing Multi-Agent Systems — https://bhargavaparv.medium.com/trust-at-scale-regression-testing-multi-agent-systems-in-continuous-deployment-environments-99dfcc5872e9
LangGraph — Durable execution docs — https://docs.langchain.com/oss/python/langgraph/durable-execution
LangGraph vs Temporal for AI Agents (durable execution) — https://medium.com/data-science-collective/langgraph-vs-temporal-for-ai-agents-durable-execution-architecture-beyond-for-loops-a1f640d35f02
Diagrid — Checkpoints Are Not Durable Execution — https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows
Zylos Research — Durable Execution Patterns for AI Agents — https://zylos.ai/research/2026-02-17-durable-execution-ai-agents

Continue reading

🤖 Multi-Agent LLM Systems

Agent Orchestration & Multi-Agent Systems — Suite Index

🤖 Multi-Agent LLM Systems

Foundations: Agents, Agentic Loops & When to go Multi-Agent

🤖 Multi-Agent LLM Systems

Context Engineering & Memory for Agents