Agent Architecture & Orchestration
So far you have probably thought of a language model as a thing you send text to and get text back from. That is true, but it is also the floor, not the ceiling. The moment you let a model do things in the world — search the web, run code, edit a file, send an email — and let it decide what to do and when, you have crossed from "calling a model" into "building an agent." This chapter is about that crossing: how multi-step, tool-using systems work, the named patterns engineers use to build them, and the hard-won practices that keep them from looping forever, lying about success, or burning your budget.
We will build up carefully. First the basic unit (a single model call), then the building block everyone composes from (the augmented LLM), then fixed workflows, then true agents, and finally multi-agent systems — climbing a ladder of power and cost. The single most important idea in the whole chapter, repeated by every serious practitioner: start with the simplest thing that works, and only add complexity when it measurably earns its place.
From a single call to an agent: the complexity ladder
Let us define the rungs of the ladder in plain words, because choosing the right rung is the first and most consequential decision you will make.
An LLM call is one prompt in, one response out. The model has no memory beyond what you put in the prompt, no access to the outside world, and takes a single step. This is the right tool for most tasks: summarize a document, classify an email as billing/technical/sales, rewrite a paragraph, extract fields from text. If a single good prompt (perhaps with a few examples) solves your problem, you are done — adding anything more only adds cost and latency.
An augmented LLM (Anthropic's term) is a plain model enhanced with three capabilities, and it is the universal building block of everything agentic:
- Retrieval — the ability to pull in outside knowledge (a model generates its own search query, then reads the result). Retrieval means "look it up" — a list of facts it did not have memorized.
- Tools — the ability to call external functions (a weather API, a database query, a file editor). Tools are the model's hands.
- Memory — the ability to carry state across turns (jot down what matters, re-read it later). Memory is the model's notepad.
Modern models actively drive all three: they decide what to search for, which tool to call, and what to remember. Get this one unit solid before you compose anything bigger.
A workflow orchestrates models and tools through predefined code paths that you, the developer, write. The control flow is fixed; the model just fills in the blanks at each step. Workflows are predictable, testable, cheaper, and faster — ideal when the task decomposes into known steps.
An agent is a system where the model itself dynamically directs its own process and tool usage — deciding what to do next, in what order, and when it is finished. The number of steps is not known in advance. Agents trade predictability, cost, and latency for flexibility on open-ended problems. Chip Huyen gives the broadest definition: an agent is anything that can perceive its environment and act upon it. The environment (a terminal, a browser, a codebase, the web) defines what is possible; the set of tools defines what the agent can actually do. In a slogan: capability = tools × planning ability.
| Rung | Who controls the steps | Steps known ahead? | Cost / latency | Use when… |
|---|---|---|---|---|
| Single LLM call | — | One step | Lowest | Summarize, classify, extract, rewrite |
| Augmented LLM | Model (within one turn) | One turn + a lookup/tool | Low | Need a fact or one external action |
| Workflow | You (fixed code) | Yes | Low–medium, predictable | Task decomposes into known steps |
| Agent | The model (a loop) | No | High, variable | Open-ended, step count unpredictable |
| Multi-agent | Lead model + workers | No | ~15× a chat | High-value, parallelizable, read-heavy |
Analogy: the recipe spectrum. A single LLM call is asking a cook one question ("how long to boil an egg?"). A workflow is a recipe card — you wrote the exact steps and the cook executes each one. An agent is handing the cook a goal ("make dinner with what's in the fridge") and letting them decide the steps: more flexible, but you no longer control exactly what happens. A workflow is a train on rails you laid down; an agent is a taxi you give a destination — it picks the route live, handles detours, but costs more and might take a wrong turn.
Tools and function calling: how a model reaches outside itself
A model on its own can only produce text. Tools (also called function calling) are the mechanism that lets it act. Here is the crucial mental model that trips up beginners: the model never runs any code itself. It only emits a structured request to call a function; your application executes the real function and feeds the result back. This separation is exactly what keeps the system controllable and safe — you decide which functions exist and what arguments are allowed.
You describe each tool with a tool definition: a name, a natural-language description of when and how to use it, and a JSON-Schema (a formal description of the expected parameters and their types) for its arguments. Here is the lifecycle of a single tool turn:
1. You send: user message + list of tool definitions ──► model
2. Model decides a tool is needed, returns a structured tool-call:
{ "name": "get_weather", "input": {"city": "Paris"}, "id": "toolu_01" }
3. YOUR code parses the JSON, VALIDATES the arguments, runs the real function
4. You send back a tool_result referencing the same id:
{ "tool_use_id": "toolu_01", "content": "{ \"rain\": true }" }
5. Model uses that result to answer ── or to call another tool
Concretely, define a tool as {name:'get_weather', description:'Get current weather for a city', parameters:{type:'object', properties:{city:{type:'string'}}, required:['city']}}. The user asks "Do I need an umbrella in Paris?" The model emits get_weather({"city":"Paris"}); your code calls the weather API; you return {rain: true}; the model replies "Yes, bring an umbrella."
Two mechanics matter as systems grow. Each tool-call carries a unique id; when you return the output you reference that same id, so the model knows which call a result answers — essential when several tools run in parallel in one turn (parallel tool calls). And a setting called tool_choice lets you constrain behavior: auto (model decides), required (it must call some tool), or forcing one specific tool. Some providers also offer strict / structured outputs, a mode that forces the model's arguments to conform exactly to your schema, improving reliability.
One more distinction worth knowing: client-side tools (your functions, a bash shell, a file editor) run in your app — you execute and return results. Server-side tools (hosted web search, code execution offered by the provider) run their own inner loop on the provider's infrastructure — a single request from you can trigger several searches before you get a response, with your app not participating in each step.
Analogy: a chef and a waiter. The model (chef) writes the order ticket ("table 4: get_weather, city=Paris") but never leaves the kitchen. The waiter (your code) carries it out, does the real work, and brings back the result. The chef only ever sees tickets and results — never the dining room. That is why "the model never executes the function itself."
The agent loop and the ReAct pattern
The heartbeat of every agent is a simple loop: perceive → reason → act → observe, then repeat. The agent reads input, decides on an action, calls a tool, reads the result, and uses that result to decide the next action. The magic is that it gets ground truth from the environment at every step (real tool outputs, real code results), which lets it self-correct instead of flying blind.
The foundational, named version of this is ReAct (Reason + Act), introduced by Yao et al. in 2022. ReAct interleaves three repeating elements:
- Thought — the model reasons about the current state and plans the next move.
- Action — the model calls a tool.
- Observation — the model reads the result the environment returns.
A ReAct trajectory looks like this:
┌──────────────────────────────────────────────┐
│ │
▼ │
THOUGHT ──► ACTION ──► OBSERVATION ──────────┘ (loop)
"I need search('Acme "$4.2B"
Acme's 2024 revenue')
revenue"
...
THOUGHT: "I have enough." ──► ACTION: finish('answer') (exit)
Exit when: model says done (stop_reason = “end_turn”)
OR max_iterations / max_tokens / timeout hit
Why interleave reasoning with acting? Because the observation keeps the model honest. Pure "chain-of-thought" (reasoning step by step from memory, eyes closed) suffers from hallucination and error propagation — one wrong turn early and the model is lost, confidently building on a false fact. ReAct opens the model's eyes at each step, grounding the reasoning in real data and cutting both problems.
In code, the loop (Anthropic's framing) is: keep calling the model while the model's stop_reason is "tool_use". Each time it asks for a tool, execute it, append a tool_result to the conversation, and call again. The loop exits when stop_reason becomes "end_turn" (final answer), or max_tokens, stop_sequence, or refusal. A max_iterations cap is the safety valve that guarantees the loop cannot run forever.
messages = [user_message]
for step in range(MAX_ITERATIONS): # hard stop — never unbounded
resp = model(messages, tools=TOOLS)
if resp.stop_reason != "tool_use":
return resp # final answer
for call in resp.tool_calls:
try:
result = run_tool(call.name, validate(call.input))
except ToolError as e:
result = f"Error: {e}. Try an absolute path." # feed error BACK
messages.append(tool_result(call.id, result))
raise Budget("hit max iterations — escalate to a human")
Analogy: the agent loop is a detective working a case. Think (form a hypothesis), act (check a clue / interview someone), observe (what you learned), then update the theory — each real observation keeps the detective from spinning fiction.
Workflow patterns: fixed code paths that do a lot
Before reaching for autonomy, know that a huge fraction of real systems are built from a handful of fixed patterns where you write the control flow. Anthropic names five. They compose freely — a router can sit in front of chains; an orchestrator can spawn parallel workers.
1. Prompt chaining
Decompose a task into a fixed sequence of calls, each consuming the previous output, with optional programmatic gates (validation checkpoints) between steps. Example: write ad copy → (gate: is the length and tone right?) → translate it. Each call is simpler and more focused, so quality beats one giant prompt. You trade a little latency for a lot of accuracy.
2. Routing
A first model (or classifier) labels the input, then your code dispatches it to a specialized prompt, model, or flow. Example: a customer message is classified as "general question," "refund request," or "technical issue" and sent to a purpose-built handler. A common cost trick is model routing: trivial questions go to a cheap fast model (Haiku-class), hard ones escalate to a stronger model (Sonnet-class). The catch: the whole system's quality is capped by classification accuracy — a misroute silently poisons everything downstream.
3. Parallelization
Run model work concurrently in two flavors. Sectioning splits independent subtasks to run at once (e.g., one model answers the user while a second runs a safety guardrail on the same input). Voting runs the same task several times and aggregates — e.g., three independent "find security vulnerabilities" passes, flagging the file if any pass raises a concern (high recall) or requiring agreement (high precision). You trade cost for confidence or coverage.
4. Orchestrator-workers
A lead model reads the input and dynamically decides what subtasks are needed — not fixed in advance — spawns a worker per subtask, then synthesizes the results. The difference from parallelization is that the number and nature of subtasks are determined at runtime. The classic example: a coding change that must touch an unknown set of files. The orchestrator inspects the repo, decides which files need edits, dispatches a worker per file, then assembles the diff — you could not have hardcoded the file list.
5. Evaluator-optimizer
A generator model produces a candidate; an evaluator model critiques it against criteria and gives concrete feedback; the generator revises; loop until it passes or a limit hits. It pays off when (a) articulating feedback would clearly improve the output, and (b) the model can produce that feedback itself. Example: literary translation refined over rounds for rhythm and imagery.
Analogy: routing is a hospital triage nurse — one quick assessment sends you to cardiology, orthopedics, or the pharmacy. If the nurse misjudges, you end up in the wrong department, which is why misclassification poisons the whole system. Orchestrator-workers is a newsroom editor assigning stories on the spot to reporters, then stitching their pieces into one article.
Making agents reliable: planning, reflection, termination, and compounding error
Autonomy introduces the defining reliability problem of multi-step systems: compounding error. Per-step reliability multiplies. A 20-step task at 95% reliability per step succeeds only 0.95²⁰ ≈ 36% of the time. Push each step to 99% and you get 0.99²⁰ ≈ 82% — more than doubling end-to-end success. So the levers that matter most are: fewer steps, verifying intermediate results, and isolating work so one bad step cannot poison the rest.
Plan/execute separation (Huyen) is the first defense: generate a plan, validate it cheaply before spending real tool calls (heuristics — does it use only real tools? is it under the step limit? — or an LLM judge), and only then execute. This avoids the failure where an unchecked agent burns hours and API dollars chasing a bad plan. Generator, validator, and executor often become separate components.
Reflection / self-critique happens at multiple points: after the query (is this feasible?), after planning (is the plan valid?), after each step (did that work?), and at the end (did I actually achieve the goal?). The named framework here is Reflexion (Shinn et al., 2023), sometimes called "verbal reinforcement learning" because the agent improves without any weight updates. It has three roles: an Actor (does the ReAct-style task), an Evaluator (scores the trajectory, even just pass/fail), and a Self-Reflection module that, on failure, writes a natural-language post-mortem and prepends it to the context on the next try. Example: attempt 1 fails the unit tests; self-reflection writes "I assumed the list was sorted; it wasn't — sort before binary search"; attempt 2 reads that note and passes. No retraining occurred; the fix lived entirely in context. Reflection costs tokens, so it is a latency/quality tradeoff.
Termination is non-negotiable. Enforce multiple independent stop conditions: max iterations, max total tokens, a wall-clock timeout, and a per-turn retry budget. Without them, a misconfigured tool can turn a $0.40 request into a $5+ runaway, or an agent can retry the same broken call dozens of times. On exhaustion, return a graceful message and escalate — never a raw stack trace.
Finally, errors are normal — feed them back. When a tool fails, catch it and return a readable error to the model (Error: No file at './data.csv'. Use an absolute path: /workspace/data/data.csv) rather than throwing an exception that kills the loop. The model reads it and adapts. Pair the model's adaptability with deterministic safeguards: retries with exponential backoff and jitter (wait progressively longer, plus a random offset so many callers do not synchronize), per-attempt timeouts, and — critically — only auto-retry idempotent operations (ones safe to repeat). A retried payment without an idempotency key (a token the backend uses to recognize a duplicate and return the existing result) creates duplicate charges.
Tool design and context engineering
Anthropic reports spending more time optimizing the agent-computer interface (ACI — the design of the tools: their names, parameters, and docs) than the top-level prompt. Tool design is often more impactful than prompt tuning. Write each tool description like a great docstring for a brand-new junior developer: clear name, unambiguous parameters, example usage, edge cases, units, and explicit boundaries against other tools.
The most powerful idea here is poka-yoke — a manufacturing term for "mistake-proofing," designing so a mistake is structurally impossible. Anthropic's SWE-bench coding agent kept making path errors with relative file paths, so they changed the tool to require absolute paths — eliminating the entire error class by construction rather than relying on the model to remember.
Long-running agents also hit a finite resource: the context window (the model's working memory of the conversation). As it fills, recall accuracy degrades — a phenomenon called context rot. Four techniques manage it:
- Compaction — when nearing the limit, ask the model to summarize the history (preserving decisions and open questions, discarding redundant tool output) and continue from the condensed version. Claude Code does exactly this for long sessions.
- Structured note-taking / external memory — the agent writes durable notes (progress tallies, maps, learned facts) to storage outside the context window and re-reads them later, giving effectively unbounded memory at low token cost. Claude playing Pokémon keeps region maps and tallies across thousands of steps this way.
- Just-in-time retrieval — carry lightweight references (file paths, URLs, query strings) and fetch the actual content only when needed (run
head/tailor a targeted query), instead of stuffing everything in up front. - Sub-agents with clean context — delegate to a worker with its own fresh window that returns only a short summary.
Analogy: the context window is a whiteboard, not a hard drive. It is small and gets messy as you scribble; old notes smudge (context rot). Compaction is wiping it and rewriting only the important bits; external memory is copying key notes into a notebook you can re-open later. And least-privilege tools are like giving a contractor one room's key, not the master key — if that key is stolen (a prompt injection tricks the agent), the blast radius is one room, not the building.
Multi-agent systems, handoffs, and protocols
The top rung is multi-agent: a lead/orchestrator agent that plans, spawns parallel subagents (each with its own clean context window and a bounded task), and synthesizes their short summaries. Anthropic's production research system (LeadResearcher + 3–5 parallel subagents + a separate CitationAgent) beat single-agent Opus by 90.2% on internal research evals — but burns roughly 15× the tokens of a chat (versus ~4× for a single agent). It wins on breadth-first, parallelizable, read-heavy search ("find all board members across S&P 500 IT companies"). It hurts on tasks needing tightly shared mutable context — like most coding, where subagents in separate windows cannot see each other's work and end up duplicating or contradicting.
There are two orchestration shapes (OpenAI's framing):
- Manager (agents-as-tools) — a central agent calls specialists as bounded tools, keeps control, and synthesizes one coherent answer. Use when results from several specialists must be combined into a single voice.
- Handoff — an agent transfers full conversation ownership to a peer specialist, who then owns the final response. Use for clean routing where a specialist should fully take over (a triage agent hands a ticket to billing, refund, or tech-support, each with its own tools and policies).
MANAGER (agents-as-tools) HANDOFF (relay)
┌───────────┐ ┌─────────┐
│ Manager │ │ Triage │
└─────┬─────┘ └────┬────┘
calls │ as tools hand │ off (ownership moves)
┌──────┼──────┐ ▼
▼ ▼ ▼ ┌──────────┐
[Sum] [Extract][Search] │ Billing │ ◄─ owns the reply now
└──────┼──────┘ └──────────┘
▼
Manager writes the ONE answer
Analogy: manager vs. handoff is a general contractor vs. a relay race. The contractor (manager) hires subs, collects their work, and presents the finished house himself. A relay (handoff) passes the baton — once you hand off, the next runner owns the race to the finish. And "15× tokens" is hiring a research team vs. asking one person: a team covers far more ground in parallel, but you pay every salary — so for a quick factual question, one person is the rational, cheaper choice.
Two open standards are emerging to standardize the plumbing. MCP (Model Context Protocol), Anthropic's open JSON-RPC client-server standard, is the "USB-C for AI" — it lets an agent discover and securely call external tools and data (a Postgres MCP server, a GitHub MCP server) through one uniform interface, instead of hand-coding bespoke glue for each. A2A (Agent2Agent), Google's protocol, sits a layer up: agents publish a capability "agent card" and delegate tasks to each other peer-to-peer. In short: MCP connects an agent to its tools; A2A connects agents to other agents.
Evaluation, guardrails, and observability
Because agents are non-deterministic (the same prompt can give different runs), you cannot ship them on vibes. Three disciplines make them production-worthy.
Evaluation. Start with end-to-end task success as a black box ("did we meet the user's goal?"), then do error analysis to find the first place things went wrong. A practical tool is the transition failure matrix: rows are the last good state, columns are where failure first appeared; counting failures per transition turns overwhelming complexity into a ranked to-do list. Use LLM-as-judge (an LLM scoring another's output against a rubric of factual accuracy, citation accuracy, completeness, source quality, and tool efficiency); Anthropic found a single prompt emitting a 0.0–1.0 score plus pass/fail more reliable than ensembles of specialized judges. You can start with just ~20 realistic queries — early changes often move scores enough to see in tiny samples. For state-changing agents, judge the final state, not whether it followed one "correct" path, because valid alternative solutions exist.
Guardrails are layered, defense-in-depth, and least-privilege. Input guardrails validate and sanitize incoming content on every turn (not just the first); output guardrails filter results before the user sees them; tool/runtime guardrails check each individual tool call. Run a safety check in parallel with the main response for low latency, or before it (blocking) to save tokens when input is clearly abusive. Scope each tool's permissions to the minimum needed (least privilege), so a single compromised tool's blast radius is small. And gate irreversible, high-stakes actions (sending money, deleting data, outbound email) behind a human-in-the-loop checkpoint — a pause for approval, which is both a safety control and a cheap insurance policy against compounding errors.
Observability. You cannot reproduce a non-deterministic failure without its full trace. Assign a trace ID at the entry point and propagate it through every model call, tool call, memory operation, and handoff, emitting a tree of spans (each one operation, annotated with tokens, latency, and cost). The OpenTelemetry GenAI semantic conventions are the emerging standard. Always print every tool call and its output so you can inspect them. Note that many in-process agent SDKs do not give you durable execution (checkpointing and resume-from-crash) — you must add it yourself, or a mid-run failure loses all progress.
Real-world use cases
- Coding agents (SWE-bench / Claude Code): environment is a terminal + file system; the agent reads code, reasons, edits, runs tests, observes, and loops until tests pass. Cited as a case where single strong agents beat multi-agent because of shared-context needs.
- Customer-support routing (workflow + handoffs): classify each ticket, send easy ones to a cheap model and hard ones to a stronger one, and hand off refunds/technical/sales to specialists with isolated tools and policies — predictable and auditable.
- Retrieval-augmented multi-hop Q&A (the ReAct origin): answering HotpotQA questions and fact-checking by interleaving reasoning with live Wikipedia API calls — the pattern that beat pure chain-of-thought on hallucination.
- Deep research (multi-agent): Anthropic's Claude Research orchestrator + parallel subagents + CitationAgent — the canonical breadth-first, parallelizable case, justified by its high value despite ~15× token cost.
- Computer-use agents: Claude clicking, typing, and reading a GUI in a perceive→act→observe loop to complete browser/desktop tasks.
- Multi-tool analytical agents (Chameleon): a model given ~13 tools (search, calculators, a code interpreter, image captioners) improved ScienceQA accuracy by ~11 points — concrete evidence that capability scales with the tool set.
Section summary
- Climb the ladder deliberately: single call → augmented LLM (retrieval + tools + memory) → workflow → agent → multi-agent. Use the simplest thing that works and add autonomy only when it demonstrably pays off.
- Tools are how a model acts, but it never runs them itself — it emits a structured request that your validated code executes and returns. Tool/interface design (ACI, poka-yoke) often beats prompt tuning.
- The agent loop is ReAct: Thought → Action → Observation, repeated, with the environment's observation grounding the reasoning and curbing hallucination.
- Five workflow patterns — prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer — solve most multi-step needs predictably and cheaply.
- Compounding error is the core reliability problem (0.95²⁰ ≈ 36%); fight it with fewer steps, plan-then-validate, reflection (Reflexion), verification, and hard termination limits.
- Manage context as a finite, degrading resource via compaction, external note-taking, just-in-time retrieval, and clean-context sub-agents.
- Multi-agent buys parallelism at ~15× tokens — worth it for breadth-first read-heavy research, poison for shared-context coding. Choose manager vs. handoff by whether you need one synthesized answer or a clean takeover; standardize with MCP (tools) and A2A (agents).
- Production agents need evaluation, layered least-privilege guardrails with human checkpoints, and full tracing — because non-determinism makes untraced failures irreproducible.