Direct Streaming - Cutting Sub-Agent Response Time from 15 Seconds to 4

When you asked Iris to delegate a question to a specialist agent, you'd wait 15 seconds in silence. The answer was fine when it arrived. The wait was not.

What was actually happening

Picture a translator sitting between two people. Person A speaks. The translator listens to the full message, writes it down, translates it word by word, then reads the translation aloud to Person B. Even if Person A only spoke for 5 seconds, Person B waits for the full write-it-down-then-translate cycle.

That's what the pipeline was doing - a store-and-forward pattern where each stage had to fully complete before the next could start. The parent agent would call a specialist, wait for the specialist to finish its entire response, receive it as a block of text, internally process it (the model literally re-reads everything and reasons about it - a full second inference pass), then write its own version to the user. Two complete AI thinking cycles, back to back, one blocking the other.

Roughly:

Specialist thinking and generating: ~6-8 seconds
Parent re-reading that output and generating its own version: ~5-7 seconds
Network and queue overhead: ~1-2 seconds

That second AI pass was entirely wasted work. The parent was spending its entire thinking budget just to rephrase what the specialist already said.

The fix: cut out the middleman

Instead of making the parent sit between the specialist and the user like a translator, we switched to something more like a conference call with live captioning. The specialist's output now streams directly to your browser, token by token, as it generates. You start reading immediately. No buffering, no waiting for the full response, no parent reprocessing.

In engineering terms this is called stream passthrough - the data flows straight through instead of being stored at an intermediate point first. It's the same idea behind why a CDN streams a video to you as it downloads from the origin server, rather than downloading the whole file first and then serving it.

The parent still knows what happened. It gets a message tagged as "already delivered to the user" - a kind of out-of-band signal that separates the delivery path (specialist to user) from the coordination path (tool result to parent). The parent sees this, understands the user already has the content, and moves on.

The numbers

Before: 15 seconds of silence, then the full response appears at once.

After: first words appear in about 3-4 seconds (the specialist's thinking startup time, called time-to-first-token or TTFT), then text streams in continuously. The parent's brief confirmation adds maybe half a second at the end.

The total content is the same. The experience is completely different. You're reading within 4 seconds instead of staring at nothing for 15. It's the same principle behind progressive rendering in web design - show something useful as early as possible, even if the full page isn't ready yet.

For parallel delegation (asking multiple specialists at once, a fan-out pattern), each specialist streams its output as it finishes. You see results arriving one after another instead of waiting for the slowest one. The system checks for completed agents every half second and pushes them through immediately, creating a natural staggered delivery.

What Claude did here

I described the bottleneck: "sub-agents are too slow, users wait 15 seconds." Claude traced through the delegation pipeline, identified the two-stage blocking pattern, and we designed the passthrough together. The implementation involved threading a feature flag through the tool chain so the system knows when to stream directly versus buffer. Claude wrote the streaming logic and the structured markers that tell the frontend which agent is speaking, while I verified the markers parsed correctly in the React chat component.