Agent Orchestration & Multi-Agent Systems — Suite Index

By Pritesh Yadav June 16, 2026 10 min read —

Entry point and reading map for an 8-part deep dive on building, running, and paying for multi-agent LLM systems.

Research/reference doc · 2026-06-16 · part of the Agent Orchestration suite

Executive summary

Agent orchestration is the discipline of coordinating one or more LLM-driven agents — each running a perceive → plan → act → observe loop over tools — to accomplish a task that a single model call cannot reliably finish on its own. “Multi-agent” specifically means splitting that work across several agents (an orchestrator plus workers, a supervisor tree, a debate, a handoff swarm) rather than one agent grinding through every step.

The core insight that runs through every part of this suite:

Add agentic complexity only when it pays for itself. Climb the ladder one rung at a time — a single well-prompted LLM call beats a workflow, a workflow beats an agent, and a single agent beats a multi-agent system — until the task’s structure forces you up. Most production “AI features” should be a prompt or a fixed workflow, not an autonomous agent, and almost never a swarm.

The single biggest tradeoff you are buying when you go multi-agent:

Roughly a 15× token cost versus a single chat turn (Anthropic measured ~4× for a single agent and ~15× for their multi-agent research system). You only recoup that when the work is high-value and parallelizable — breadth-first research, fan-out over many independent sources, sweeps with disjoint ownership. For interdependent reasoning under a fixed budget (most coding, most tightly-coupled planning), a single agent wins on both cost and quality, because splitting it just multiplies coordination failures.

Everything else in the suite is the detail behind those two sentences: which pattern fits which task (02), how agents talk to each other and to tools (03), what framework to reach for (04), how to keep a finite context window useful across a long run (05), how to make a non-deterministic long-horizon system reliable and observable (06), how to control the cost (07), and how all of it lands on Print-Flow-360’s actual code (08).

Read in this order

#	File	What it covers
01	Foundations: Agents, Agentic Loops & When to go Multi-Agent	Vocabulary + decision framework: the three-rung ladder (call → workflow → agent), the augmented LLM, the ReAct loop, the four “agentic” properties, Anthropic’s five workflow patterns, and the single-vs-multi gate with measured costs. Start here.
02	Orchestration Patterns & Topologies	Eleven patterns along the determinism→autonomy axis — chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer, supervisor trees, blackboard, handoff swarms, debate, map-reduce, reflection loops — each with an ASCII topology, when-to-use/when-not, and failure modes.
03	Communication, Coordination & Interop Protocols	How agents coordinate: message-passing vs shared-memory blackboards vs LangGraph state, handoff semantics, task decomposition + result aggregation, and the open standards MCP (agent↔tools) and A2A (agent↔agent), plus structured-output messaging pitfalls.
04	Frameworks & SDKs Landscape	Nine frameworks (LangGraph, CrewAI, AutoGen/AG2, OpenAI Agents SDK, Claude Agent SDK, Google ADK, Semantic Kernel, LlamaIndex) on an explicit→emergent control spectrum, with “use it when / avoid when” and a selection algorithm. Opens with “do you even need a framework?“
05	Context Engineering & Memory for Agents	Managing the finite attention budget: reduce / offload / retrieve / isolate, context rot & lost-in-the-middle, memory tiers (short/long, semantic/episodic/procedural), RAG vs agentic retrieval, sub-agent isolation, and prompt-caching mechanics.
06	Reliability, Evaluation & Observability	Why compounding per-step error breaks long runs, the MAST 14-failure-mode taxonomy, layered guardrails, outcome vs trajectory evals + LLM-as-judge, OpenTelemetry GenAI tracing, non-determinism testing, and resumable/durable execution.
07	Cost, Performance & Economics	The ~15× multiplier in detail, token spend as ~80% of performance variance, when multi-agent pays off, and the control levers: model tiering, prompt caching (90% input discount), Batch API (50% off), rate-limit management, and runtime budget guardrails.
08	Applied: Agent Orchestration in Print-Flow-360	How the patterns map onto PF360’s real AI code (`AiTaskRunner` + 7 consumers), the four hard constraints (no tool-use, no caching, no live-catalog injection, `QUEUE=sync`), and four realistic-but-unbuilt orchestration opportunities. Research-only.

Cheat sheet (one page)

Decision flow — single call → workflow → single agent → multi-agent

Start: what is the task?
  │
  ├─ Can one well-crafted prompt do it reliably?           → PLAIN LLM CALL.  Stop.
  │     (classify, extract, rewrite, summarize, draft)
  │
  ├─ Are the steps known and fixed in advance?             → WORKFLOW.  Stop.
  │     (chain / route / parallelize / evaluator-optimizer
  │      on a predetermined graph — deterministic control flow)
  │
  ├─ Does the path depend on intermediate results,         → SINGLE AGENT (one ReAct loop, tools).
  │  needing dynamic tool use & self-correction?              Stop here unless a gate below trips.
  │
  └─ Do ALL of these hold?                                 → MULTI-AGENT (orchestrator-worker etc.)
        • the work is HIGH-VALUE (worth ~15× tokens)
        • it is genuinely PARALLELIZABLE (independent subtasks)
        • a single context window can't hold it (overload)
        • subtasks have clean, disjoint ownership
      If any is NO → stay with a single agent.

Two gates justify the jump to multi-agent, and you need both:

Economic gate — the output is valuable enough to absorb a ~15× token bill.
Overload / parallelism gate — one agent’s context or serial latency is the bottleneck, and subtasks are independent enough to run side by side without constant coordination.

Pattern picker (condensed — full version in part 02)

Pattern	Parallel?	Control	Relative cost	Best for
Prompt chaining	No	Deterministic	Low	Fixed multi-step transforms (outline→draft→polish)
Routing	No	Deterministic	Low	Classify input, dispatch to a specialized handler
Parallelization (sectioning / voting)	Yes	Deterministic	Medium	Independent subtasks, or N votes for reliability
Orchestrator-worker	Yes	Dynamic	High (~15×)	Breadth-first research, dynamic fan-out over sources
Evaluator-optimizer	No	Loop	Medium	Output has clear quality criteria; iterate to a bar
Supervisor tree	Yes	Hierarchical	High	Large tasks needing layered delegation
Blackboard / shared state	Yes	Emergent	High	Many agents collaborating on one evolving artifact
Handoff / swarm	No	Decentralized	Medium-High	Specialist takes full control of a phase, then hands off
Debate / ensemble	Yes	Voting	High	High-stakes judgment where diversity beats one opinion
Map-reduce	Yes	Deterministic	Scales w/ corpus	Summarize / extract over a large document set
Reflection loop (ReAct / Reflexion)	No	Loop	Medium	Single agent self-correcting until done

Rule of thumb: prefer the lowest row that fits, and prefer deterministic control over emergent. Reach for the high-cost rows only when both gates above are satisfied.

Top 8 do’s

Start at the simplest rung that could plausibly work; add complexity only when you’ve watched the simpler version fail for a structural reason.
Make control flow as deterministic as the task allows — a workflow you can read beats an agent you have to debug.
Tier your models — a strong orchestrator with cheap workers captures most of the quality at a fraction of the cost (part 07).
Engineer the context, not just the prompt — reduce / offload / retrieve / isolate to keep the window’s signal high (part 05).
Give subagents disjoint ownership and clear contracts so their outputs merge cleanly without conflict.
Instrument from day one — OpenTelemetry GenAI traces, trajectory + outcome evals, small high-signal eval sets (part 06).
Design for resume-not-restart — checkpoint state so a long run can recover from a failed step (part 06).
Exploit caching and batching — a stable cached prefix (90% input discount) and the Batch API (50% off) change the economics materially (part 07).

Top 8 don’ts

Don’t reach for multi-agent by default — it’s the most expensive and failure-prone option, not the impressive one.
Don’t split interdependent reasoning across agents — coordination overhead and context loss will cost you quality (coding is the classic trap).
Don’t let context rot — never assume more tokens = better; mid-context info degrades (“lost in the middle”).
Don’t share one mutable context across agents carelessly — context pollution and error blast radius compound silently.
Don’t ship without guardrails — input/action/output/permission checks and human-in-the-loop on irreversible actions.
Don’t treat a multi-agent system as deterministic in tests — design for non-determinism and assert on outcomes/invariants.
Don’t pick a heavyweight framework before you need one — Anthropic’s own advice: start without one (parts 01, 04).
Don’t ignore the bill — token spend is ~80% of performance variance; a runaway loop is a runaway invoice (part 07).

How this maps to Print-Flow-360

If you’re here to ship something in this codebase rather than to study the field, read part 08 after part 01. It grounds every abstract pattern in PF360’s actual AI surface: the single stateless AiTaskRunner wrapping per-tenant Claude Sonnet / OpenAI providers behind seven feature consumers, and the four hard constraints that shape what’s even possible today — no native tool-use, no prompt caching, no live-catalog injection, and QUEUE=sync forcing inline runs. It then sketches four realistic-but-unbuilt orchestration opportunities (orchestrator-worker product builder, content/SEO chaining, support triage, evaluator-optimizer design loop), each tied to a concrete code seam and gated on the same shared enablers (Redis async, tool-use, prompt caching).

Part 08 is explicitly research-only — nothing in this suite has been built into PF360. It also flags two correctness traps: image generation is stubbed, and the AI matcher docs describe intent, not shipped code.

Sources

This index summarizes the eight parts; each part carries its own full source list. The load-bearing primary sources across the suite:

Building effective agents — Anthropic — https://www.anthropic.com/research/building-effective-agents
How we built our multi-agent research system — Anthropic — https://www.anthropic.com/engineering/built-multi-agent-research-system
Effective context engineering for AI agents — Anthropic — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Why Do Multi-Agent LLM Systems Fail? (MAST taxonomy) — Cemri et al. — https://arxiv.org/abs/2503.13657
ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. — https://arxiv.org/abs/2210.03629
Lost in the Middle: How Language Models Use Long Contexts — Liu et al. — https://arxiv.org/abs/2307.03172
Context Rot — Chroma Research — https://research.trychroma.com/context-rot
Model Context Protocol (MCP) — https://modelcontextprotocol.io/
Agent2Agent (A2A) Protocol — https://a2aproject.github.io/A2A/
OpenTelemetry GenAI semantic conventions — https://opentelemetry.io/docs/specs/semconv/gen-ai/
Prompt caching — Anthropic docs — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
LangGraph documentation — https://langchain-ai.github.io/langgraph/
OpenAI Agents SDK — https://openai.github.io/openai-agents-python/

Continue reading

🤖 Multi-Agent LLM Systems

Foundations: Agents, Agentic Loops & When to go Multi-Agent

🤖 Multi-Agent LLM Systems

Context Engineering & Memory for Agents

🤖 Multi-Agent LLM Systems

Applied: Agent Orchestration in Print-Flow-360