Architecture

Intelligent Agent Routing - How Iris Picks the Right Agent in 40ms

February 12, 2026

When you type "build me a Laravel model with a factory and migration", you mean the Engineer. When you say "research AI trends in healthcare", you want the Researcher. Previously, Iris relied on either you picking the right agent from a dropdown or the default agent burning 10-30 seconds of LLM inference to figure out where to delegate.

Neither was good enough. Manual selection breaks flow. LLM-driven delegation is slow and expensive. So I built a semantic pre-routing layer that sits between your message and the agent, resolving the right specialist in about 40 milliseconds.

Before and after

Before routing, every conversation started the same way regardless of what you asked:

  1. Message hits the default MainAgent
  2. MainAgent's LLM reads the message, reasons about which specialist to call
  3. MainAgent invokes SendToSubAgent or ParallelSubAgents tool
  4. The specialist agent spins up, receives the delegated task, generates a response
  5. Total time before the user sees anything useful: 10-30 seconds

That delegation tax applied even when the answer was obvious. "Write me a Laravel Eloquent scope for active users" - clearly the Engineer - still burned a full LLM inference cycle just to arrive at that conclusion.

After routing:

  1. Message hits the routing layer (25-45ms)
  2. Routing embeds the query, scores all agents, picks Engineer with 0.467 confidence
  3. Engineer receives the message directly, starts generating immediately
  4. First token arrives in 3-4 seconds

The delegation step is gone. No LLM reasoning about "which agent should handle this." The semantic layer already knows. For unambiguous queries (which are the majority), this cuts 10-25 seconds of dead time on every new conversation.

The vector foundation

The entire routing system runs on cosine similarity - measuring the angle between two vectors in high-dimensional space. If two vectors point in roughly the same direction, the query and the agent are semantically related. If they point away from each other, they are not.

Iris already had cosine similarity buried inside EmbeddingRerankPipelineService, the memory retrieval pipeline that ranks which memories are relevant to a conversation. It was an inline method, tightly coupled to memory logic. Routing needed the same maths but in a completely different context - comparing query embeddings against agent profiles, not user memories.

So the first step was extraction. Cosine similarity moved into VectorMath, a standalone utility class with a single static method. The memory pipeline now delegates to it instead of computing its own. The routing scorer calls the same method. One implementation, two consumers, zero duplication.

The maths is straightforward: dot product of two vectors divided by the product of their magnitudes. But there are edge cases that matter in practice. Empty vectors (agent has no embedding yet) return 0.0 instead of dividing by zero. Mismatched dimensions (query embedded at 4096 dims, stale profile at 1024 dims) gracefully truncate to the shorter length rather than crashing. Zero-magnitude vectors (all zeros, which happens if an embedding API returns garbage) return 0.0 rather than NaN. These guard rails sound obvious but they saved real debugging time during the 0.6B-to-8B migration when old profiles were temporarily still in the database.

Each cosine similarity computation across 4096 dimensions takes microseconds. Scoring 18 agents means 18 cosine calls - still under a millisecond total. The embedding API call dominates the timing budget, not the vector maths.

How it works

Every agent gets a capability profile - a text summary of its name, description, system prompt, and routing keywords, embedded into a 4096-dimensional vector using Qwen3-Embedding-8B. These profiles are pre-computed and cached.

When a new conversation starts (and you haven't manually picked an agent), the routing service:

  1. Embeds your message into the same vector space
  2. Computes cosine similarity against every agent's profile
  3. Blends four signals into a final score:
    • Semantic similarity (60%) - how closely your query matches the agent's capabilities
    • Performance history (20%) - exponential moving average of past routing outcomes
    • Keyword matching (15%) - binary check against explicit routing keywords on each agent
    • Recency (5%) - slight preference for recently successful agents, decaying over 48 hours
  4. If the top score clears the confidence threshold (0.30), that agent handles it
  5. If not, the default agent takes over with its existing delegation tools

The whole process adds about 25-45ms to the first message. Subsequent messages in the same conversation skip routing entirely - once an agent is assigned, it stays.

The self-learning bit

Every routing decision gets recorded. A scheduled job runs every 10 minutes, looks at recent outcomes, and classifies them:

  • Positive: response was substantial, no errors, user didn't switch agents, follow-up messages came
  • Negative: user overrode the agent, errors occurred, response was unusably short
  • Neutral: everything else

These signals feed back into each agent's performance score via an exponential moving average (learning rate 0.1). Agents that consistently deliver good results for certain query types gradually score higher for those queries. Agents that get overridden score lower. The system gets better at routing over time without any manual tuning.

The embedding model matters more than you'd think

I started with Qwen3-Embedding-0.6B (1024 dimensions). It was fast and cheap, but the semantic scores were nearly flat across agents. A coding query would score the Engineer at 0.38 and the Content Writer at 0.35 - barely distinguishable. The model simply didn't have enough capacity to differentiate between nuanced capability descriptions.

Switching to Qwen3-Embedding-8B (4096 dimensions) was transformative. The same coding query now scores the Engineer at 0.62 and the Content Writer at 0.31. "Set a reminder for tomorrow" gives Automation Operator 0.74 and Engineer 0.28. The semantic signal became sharp enough to be the primary routing factor, exactly as designed.

What I learned about keywords

Keywords are a safety net, not the primary routing mechanism. But they need to be specific. Early on I had "write" as a keyword on the Content Writer. Every "write me a Laravel migration" message got caught by it. "Data" on the Researcher collided with "data processing" queries meant for the Engineer.

The fix was simple: make keywords domain-specific. The Content Writer gets "draft", "article", "blog post", "proofread". The Engineer gets "code", "debug", "refactor", "deploy". No overlapping generic terms. Keywords should disambiguate, not compete.

Why first-message-only

I briefly considered routing every message, not just the first. The idea was appealing - "research AI trends" mid-conversation with the Engineer could automatically hand off to the Researcher. But routing treats each message in isolation. It has no access to conversation history, tool results, or the context the current agent has built up. Routing a follow-up message like "now summarise that" would land on a random agent with zero context.

The right architecture is: route the first message semantically, then let the LLM's existing delegation tools handle mid-conversation shifts where the full context is available. The routing layer picks the starting point. The orchestration layer handles everything after.

Override detection

When you manually switch agents mid-conversation, the system notices. It marks the routing decision as overridden (negative signal), logs the switch, and the feedback loop adjusts. If users consistently override a particular routing pattern, the agent's score for that query type naturally decreases over time.

This means I never have to manually tune routing weights. The system observes what actually works and adjusts itself. Bad routing decisions are self-correcting.

The numbers

Operation Time
Load cached profiles < 1ms
Embed query (API call) 20-40ms
Score all agents (cosine + arithmetic) < 1ms
Record decision (DB insert) 1-2ms
Total 25-45ms

For comparison, LLM-driven delegation via SendToSubAgent was 10-30 seconds. That is a 250-750x speedup for the common case where the right agent is obvious from the query.

In practice

Here is a real routing decision from the first day of live traffic. The query: "build me a Laravel model with a factory and migration."

Agent Semantic Keyword Performance Recency Final Score
Engineer 0.362 1.0 0.50 0.0 0.467
Researcher 0.289 0.0 0.50 0.0 0.273
Content Writer 0.241 0.0 0.50 0.0 0.245
Automation Operator 0.198 0.0 0.50 0.0 0.219

Engineer wins. Semantic similarity is moderate (0.362) but the keyword signal on "code" reinforces the match, pushing the final score past the 0.30 threshold. The whole decision took under 45ms. The response came back with a working Eloquent model, factory, and migration - 2,499 characters, no errors, no override.

Within the same conversation, a follow-up changed scope entirely: "now add a LazyCollection CSV importer to it." Still clearly engineering work. Because routing only fires on the first message, the Engineer stayed assigned. No unnecessary agent switch, no lost context. The agent adapted within its own domain expertise, which is exactly what you want - the routing layer picks the starting point, the agent handles everything after.

After upgrading to Qwen3-Embedding-8B, the semantic scores sharpened dramatically. Here is the same query type with the 8B model:

Agent Semantic (0.6B) Semantic (8B) Delta
Engineer 0.362 0.620 +0.258
Content Writer 0.350 0.310 -0.040
Researcher 0.289 0.280 -0.009
Automation Operator 0.198 0.280 +0.082

The 0.6B model scored Engineer and Content Writer within 0.012 of each other for a coding query. The 8B model puts 0.310 gap between them. That is the difference between confident routing and a coin flip.

Early metrics from live traffic

After the first 11 routing decisions:

Metric Value
Total routings 11
Average confidence 0.469
Override rate 18.2% (2/11)
Average response length 1,387 chars
Error rate 0%

Per-agent breakdown:

Agent Routings Avg Confidence Overrides Avg Response
Engineer 6 0.460 1 1,777 chars
Researcher 2 0.565 1 1,114 chars
Automation Operator 2 0.327 0 882 chars
Content Writer 1 0.612 0 605 chars

Engineer dominates early traffic, which makes sense - most of my queries are technical. The two overrides both came during initial testing when keywords were still too broad ("write" on Content Writer, "data" on Researcher). After tightening keywords, subsequent routings have been clean.

Content Writer had the highest single-query confidence (0.612) because "draft me a blog post about..." is unambiguous - strong semantic match plus a keyword hit on "blog post". Automation Operator's lower confidence (0.327, barely above threshold) reflects that automation queries tend to be more ambiguous ("set up a workflow" could be several agents).

The cases that taught me the most were the ambiguous ones. "Help me plan a marketing campaign" - is that the Content Writer (drafting copy) or the Researcher (market analysis) or the General Assistant (broad planning)? With the 8B model, the Researcher wins on semantic similarity because "plan" and "campaign" align with research and strategy language in its capability profile. Whether that is the right call depends on what the user actually wants, and that is where the feedback loop earns its keep. If users consistently override that routing, the Researcher's score for marketing queries drops and the system self-corrects.

What's next

The routing data is accumulating. Once there's enough history, I want to build per-query-type performance dashboards - which agents handle which topics best, where confidence is consistently low (indicating a gap in agent coverage), and whether the keyword signal is still pulling its weight relative to pure semantic matching. The feedback loop is the foundation; the analytics are the payoff.