Agent Orchestrationdrafting2026-05-23

Agent orchestration, and the patterns that earn their complexity

Working notes on multi-agent systems. Which patterns I keep reaching for, which ones I've burned myself on, and the primitives I think are still missing.

Most of my work right now lives between "single agent in a loop" and "actually orchestrated system." That gap is where the interesting trade-offs are. Adding a second agent is a tax. It costs tokens, latency, complexity, and a new failure surface. So the question I keep asking is: when does the second agent pay rent?

Here's what I think I know so far, plus what's still moving.

Patterns I keep reaching for

Parallel exploration. Three read-only agents searching different angles of the codebase, returning summaries to a planner agent. Cheap, fast, and the failure mode is bounded (they can't write). I use this for any "where is X defined" or "how is Y used" question that's bigger than a single grep.
Plan / review / execute splits. One agent plans, a second adversarially reviews the plan, a third executes. The reviewer catches the things the planner skipped because it was already attached to the answer. The cost is one extra turn; the value is roughly catching every "looks reasonable but assumes the wrong constraint" plan I would have shipped.
Durable orchestrators for anything that crosses a network boundary. If an agent step depends on an LLM call, an API call, or a long-running build, the orchestrator should be durable. Vercel Queues, Temporal, anything that resumes on crash. I've lost too many three-step workflows to a single 502 to keep doing this in-memory.
Specialist /skills invoked by a generalist. Instead of a giant prompt that knows everything, a generalist that knows when to call /qa, /ship, /investigate, /codex. The routing is the orchestration.

Worked example: the plan/review/execute split

The pattern above sounds abstract, so here's a real run of it from this site. The task: surface per-run token counts and cost on every lab page. The catch: the labs stream plain text to a raw output panel, and the AI SDK only reports usage after the stream ends.

Turn 1: the plan agent

The planner got the relevant files and one instruction:

planner prompt (condensed)

Read lib/labs/use-lab-run.ts and one lab route. Propose how to
surface token usage through a plain text stream without switching
to a structured stream protocol. The client renders raw text.

It came back with a reasonable design: append a machine-readable trailer line after the markdown (an HTML comment carrying JSON), have the client parse it off after the stream ends, compute cost client-side from a pricing table.

Turn 2: the reviewer

A second agent, fresh context, no attachment to the design, got only the plan and the same two files. It caught the bug the planner missed: the output panel renders raw text as it streams, so the trailer, or a partial marker split across chunk boundaries, would flash on screen before the parse ran. The fix, buffering the stream and holding back any trailing partial-marker suffix, only exists because a reviewer with no memory of writing the plan read it like a stranger.

That's the whole trick. The planner was already attached to its answer. The reviewer wasn't.

Turn 3: execute

The executor followed the reviewed plan. The diff was boring, which is the point. You can see the result on any lab page: run one, and the token counts, estimated cost, and latency print under the output.

The cost of the split was one extra turn. What it bought was the class of bug that survives a "looks reasonable" self-review.

Where these patterns are already running

These aren't just notes; most of them are deployed on this site or in the systems it documents:

Gateway as the orchestration boundary: Partner Enablement MCP puts policy, audit, and human approval between the agent and the write path. The orchestrator's control surface is the product.
Narrated multi-step planning: the Vibe Coding lab has an agent-orchestration mode that streams exactly this choreography.
PRD-as-plan handed to an executor: the Feature Delivery lab and the PRD as executable spec experiment are the same split, one turn earlier.
Visible tool loops: the Ask This Codebase lab runs a real multi-step agent over this site's own source with every tool call on screen.
Cost runaway (below) now has a shipped mitigation here: every lab prints its per-run token cost, so the number that used to surprise me at the end of the month is on the screen at the end of the run.

Failure modes I've actually hit

Context bleed. Sub-agent finishes, returns a 2,000-token summary, parent agent now has both its own context and the summary, and the summary subtly contradicts what the parent was holding. The parent goes off the rails confidently. The fix is treating sub-agent output like input from a stranger: read it, decide what to keep, throw the rest away.
Agent loops. Two agents handing work back and forth, each thinking the other is in charge. Always solvable with an explicit termination condition, never as obvious in design as it is in retrospect.
Cost runaway. Parallel exploration with three sub-agents at $X per call, called twenty times in a session, is a real bill. I now budget per workflow and surface the number to myself.
The orchestrator is wrong. The hardest one. The plan was good, the agents executed well, the result was the wrong thing because the orchestrator framed the problem badly. No amount of agent quality fixes a bad orchestrator.

What I think is still missing

First-class agent observability. I want a timeline view: which agent ran, what it saw, what it returned, what the parent did with it, where the context delta came from. The tooling exists in pieces (LangSmith, Helicone, the AI SDK telemetry) but not in the shape I want.
Cheap intermediate evaluation. Right now I run an agent, see what came back, and judge it myself. I want a small, fast eval-as-you-go that catches "this output is plausibly broken" without me reading every line.
A real story for human takeover mid-workflow. The handoff pattern in the gstack /browse skill is the closest thing I've seen to "stop, let me drive, resume." Most workflows fall back to "fail, ask the user, start over."

Drafting status, not settled. The plan/review/execute split in the worked example is the one I'd defend in a design review today. The rest are the ones I'd ship, knowing the list I'd ship in three months is probably different.