DEV Community

Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at craftedbydaniel.com

My Voice Router That Refuses to Think: Pattern‑First Multi‑Agent Orchestration for Sub‑Second Latency

I built my voice assistant orchestration layer around a stubborn rule:

Don't spend an LLM call on a problem that a regex can solve.

That reads like common sense until you ship a voice-first UX. People tolerate a spinner on a website; they do not tolerate dead air in their ear. In voice, a two-second gap doesn't feel like "loading"---it feels like the system didn't hear you.

So I stopped treating routing as an AI problem and started treating it as a latency budget problem.

At the front of the stack I put a RouterAgent whose only job is to make a routing decision fast---usually without a model call---by doing deterministic pattern matching and keyword detection. Only when the query is genuinely ambiguous do I let the router fall back to an LLM classifier.

From there, an orchestrator fans out into specialized agents (location normalization, search/retrieval, Microsoft 365 operations, CRM operations), and then stitches the results back into a voice-optimized response.

The system behaves less like a committee meeting and more like a good emergency dispatcher: triage first, specialists second, narration last.

(One analogy, once: the RouterAgent is my 911 operator. It doesn't perform surgery---it decides whether to send an ambulance, a fire truck, or both, and it does that quickly.)


What went wrong first (the incident that forced the redesign)

My first version of the voice orchestrator did what most "smart" stacks do by default:

  1. Send the raw transcript to an LLM to classify intent.
  2. Based on the intent, call the relevant tool/agent.
  3. Summarize the output for voice.

That sounds clean on paper. In production, it created a failure mode that was both obvious and embarrassing:

The symptom

We started seeing voice sessions where the assistant would pause long enough that users repeated themselves, which caused a cascade:

  • Duplicate tool calls (same query twice)
  • Conflicting state ("No, I meant the first one...")
  • The second run often landing on a different interpretation because the context had changed

The trigger query

The query that made this undeniable was the most boring one imaginable:

"How many open tickets in the Northeast?"

That's not an "AI" query. It's a search count with a location constraint.

But in the model-first router, that query still paid the full classification tax before anything useful happened.

The numbers (before)

I pulled a sample of traces and bucketed the latency into three segments:

  • T_router: time spent deciding what to do
  • T_tools: time spent doing the work (search, M365, etc.)
  • T_voice: formatting + response packing for the voice layer

On the model-first design, for "obvious intent" queries like the one above, T_router dominated.

In a benchmark run against a representative sample of voice traffic, the median transcript length was 7--12 words and the "obvious" bucket (location + search/count keywords) made up the majority of voice queries.

For that bucket:

  • p50 end-to-end (router + tool + formatting): ~1.4s
  • p95 end-to-end: ~2.7s
  • The router/classification step alone was frequently 600--1200ms, and spiky under load.

The user impact was clear in the session logs: when silence exceeded ~1.8s, the "repeat rate" (user re-asks within 4 seconds) jumped sharply. Once repeats start, everything downstream gets noisier.

The wrong assumption

My wrong assumption was subtle:

"Classification is cheap compared to the real work."

That's often true in text chat because the "real work" might be multi-step tool calls. In voice, the "real work" for a big portion of queries is a single fast retrieval---meaning classification becomes the biggest line item.

The fix

I inverted the cost structure:

  • Route deterministically in the common case
  • Only call a model when the router can't decide

And I made the router fast enough that it can run on every turn without being part of the problem.

The numbers (after)

After the pattern-first router shipped, the same "obvious intent" bucket looked like this:

  • p50 end-to-end: ~620ms
  • p95 end-to-end: ~1.1s
  • T_router p95: < 8ms for deterministic routes
  • LLM fallback triggered on a small minority of queries, and those remained slower---which is fine, because that's exactly where I want to spend model latency.

That was the turning point: the system stopped feeling like it was "thinking" about everything and started feeling responsive.


Key insight: routing is a latency budget problem

In my codebase, the RouterAgent is explicitly described as:

  • "Lightweight intent classifier that routes to specialized agents."
  • Designed to be fast to minimize voice latency.
  • Uses pattern matching and keyword detection, with LLM fallback for ambiguous queries.

The routing rules are intentionally boring:

  • Location keywords (state names, cities, "in the Northeast") -> LocationAgent + SearchAgent
  • Search keywords (tickets, orders, inventory) -> SearchAgent
  • Email/calendar keywords -> M365Agent
  • CRM keywords -> CrmAgent
  • Ambiguous -> LLM classification

The naive design is "send everything to the model, ask what the user meant, then route." It's conceptually clean---and operationally expensive.

The design that works in voice is:

  • deterministic routing for the 80%
  • controlled escape hatch for the remaining ambiguity

Architecture (as built)

flowchart TD
  voiceRoute[Voice route] --> voiceProcessor[VoiceAgentProcessor]
  voiceProcessor --> orchestrator[AgentOrchestrator]
  orchestrator --> router[RouterAgent]
  router --> locationAgent[LocationAgent]
  router --> searchAgent[SearchAgent]
  router --> m365Agent[M365Agent]
  router --> crmAgent[CrmAgent]
  locationAgent --> searchAgent
  searchAgent --> response[Voice-optimized response]
  m365Agent --> response
  crmAgent --> response
Enter fullscreen mode Exit fullscreen mode

The diagram is simple, but the important constraint is where complexity is allowed to live:

  • The router stays cheap.
  • The orchestrator coordinates.
  • Specialized agents do the expensive work.

The RouterAgent: deterministic first, model only when cornered

RouterAgent prioritizes predictable routing over "understanding." That's not an ideology; it's how you hit latency targets.

What the latency numbers mean (and how I measured them)

When I say "fast," I'm not quoting a vibe. I'm talking about a specific measurement:

  • Metric: routing decision latency (start of router -> intent decision produced)
  • Environment: AWS Graviton instance (4 vCPU), Python 3.11, warm process, no network calls on deterministic paths
  • Workload: 10,000 synthetic transcripts modeled on real traffic patterns (median 9 words, 90th percentile 18 words)
  • Reporting: p50 / p95 / p99 using a simple histogram recorder

On deterministic routes, the router is basically:

  • lowercase
  • regex + keyword checks
  • lightweight scoring

That's why the p95 stays in single-digit milliseconds.

When the LLM fallback triggers, routing latency obviously increases because it becomes "network + model." That's fine; the fallback is not the critical path for the majority of turns.

Why patterns don't collapse into a brittle rules engine

Pattern routing fails when you do either of these:

  1. You pretend patterns can cover everything.
  2. You keep piling on rules without a governance mechanism.

I avoided both by:

  • Keeping the deterministic rules intentionally coarse (high precision, decent recall)
  • Treating the fallback as a normal path for ambiguous queries
  • Logging "no-match" and "multi-match" cases so I can tighten patterns when it's worth it

A minimal, runnable RouterAgent implementation

import re
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class Intent(str, Enum):
    SEARCH = "search"
    LOCATION_SEARCH = "location_search"
    M365 = "m365"
    CRM = "crm"
    AMBIGUOUS = "ambiguous"


@dataclass
class RoutingDecision:
    intent: Intent
    agents: list[str]
    confidence: float
    reason: str
    used_fallback: bool = False
    latency_ms: float = 0.0


# --- Pattern definitions (coarse on purpose) ---

LOCATION_PATTERNS = [
    re.compile(r"\b(?:in|from|near|around)\s+[A-Z][a-z]+", re.IGNORECASE),
    re.compile(
        r"\b(?:northeast|southeast|midwest|west coast|east coast"
        r"|california|texas|new york|florida|illinois)\b",
        re.IGNORECASE,
    ),
]

SEARCH_KEYWORDS = re.compile(
    r"\b(?:how many|find|search|list|show|count|look up"
    r"|tickets|orders|inventory|items|records)\b",
    re.IGNORECASE,
)

M365_KEYWORDS = re.compile(
    r"\b(?:email|calendar|schedule|meeting|outlook|teams"
    r"|send mail|book a meeting|invite)\b",
    re.IGNORECASE,
)

CRM_KEYWORDS = re.compile(
    r"\b(?:crm|call log|deal|pipeline|contact list"
    r"|sales call|account notes)\b",
    re.IGNORECASE,
)


class RouterAgent:
    """Deterministic-first intent router with LLM fallback."""

    def __init__(self, llm_client=None, llm_timeout: float = 2.0):
        self.llm_client = llm_client
        self.llm_timeout = llm_timeout

    async def route(self, transcript: str) -> RoutingDecision:
        start = time.perf_counter()
        text = transcript.strip().lower()

        # --- deterministic pass ---
        has_location = any(p.search(transcript) for p in LOCATION_PATTERNS)
        has_search = bool(SEARCH_KEYWORDS.search(text))
        has_m365 = bool(M365_KEYWORDS.search(text))
        has_crm = bool(CRM_KEYWORDS.search(text))

        decision: Optional[RoutingDecision] = None

        if has_location and has_search:
            decision = RoutingDecision(
                intent=Intent.LOCATION_SEARCH,
                agents=["LocationAgent", "SearchAgent"],
                confidence=0.95,
                reason="location + search keywords detected",
            )
        elif has_search and not has_m365 and not has_crm:
            decision = RoutingDecision(
                intent=Intent.SEARCH,
                agents=["SearchAgent"],
                confidence=0.90,
                reason="search keywords detected",
            )
        elif has_m365 and not has_crm:
            decision = RoutingDecision(
                intent=Intent.M365,
                agents=["M365Agent"],
                confidence=0.90,
                reason="M365 keywords detected",
            )
        elif has_crm and not has_m365:
            decision = RoutingDecision(
                intent=Intent.CRM,
                agents=["CrmAgent"],
                confidence=0.90,
                reason="CRM keywords detected",
            )

        if decision is not None:
            decision.latency_ms = (time.perf_counter() - start) * 1000
            return decision

        # --- LLM fallback for genuinely ambiguous queries ---
        decision = await self._llm_classify(transcript)
        decision.latency_ms = (time.perf_counter() - start) * 1000
        return decision

    async def _llm_classify(self, transcript: str) -> RoutingDecision:
        """Call the LLM classifier. Returns a decision with used_fallback=True."""
        if self.llm_client is None:
            return RoutingDecision(
                intent=Intent.AMBIGUOUS,
                agents=["SearchAgent"],  # safe default
                confidence=0.30,
                reason="no LLM client configured; defaulting to search",
                used_fallback=True,
            )

        import asyncio

        try:
            resp = await asyncio.wait_for(
                self.llm_client.classify(transcript),
                timeout=self.llm_timeout,
            )
            return RoutingDecision(
                intent=Intent(resp["intent"]),
                agents=resp["agents"],
                confidence=resp.get("confidence", 0.70),
                reason=f"LLM classification: {resp.get('reason', 'n/a')}",
                used_fallback=True,
            )
        except (asyncio.TimeoutError, Exception) as exc:
            return RoutingDecision(
                intent=Intent.AMBIGUOUS,
                agents=["SearchAgent"],
                confidence=0.20,
                reason=f"LLM fallback failed ({type(exc).__name__}); defaulting",
                used_fallback=True,
            )
Enter fullscreen mode Exit fullscreen mode

This snippet does three things I care about in production:

  • makes deterministic decisions without network calls
  • returns a structured decision (intent, agents, confidence, reason)
  • clearly marks whether the fallback was used

That last point matters when you're tuning spend and latency: you can't control what you don't measure.


The orchestrator: coordination, not intelligence

Coordinating multiple agents, the orchestrator is responsible for sequencing and merging---nothing more. In my implementation, that means:

  • ask the router for a decision
  • run one agent or a short chain
  • normalize output into a voice-friendly response object
  • enforce time budgets and failure behavior

A common anti-pattern is letting the orchestrator "think" (LLM plan generation, free-form tool selection) and then calling the tools. For voice, that usually turns into extra steps, extra variance, and harder debugging.

The orchestration pattern that shows up constantly: normalize -> retrieve

The most common multi-agent chain in my system is:

  1. LocationAgent extracts and normalizes a constraint (state/city/region)
  2. SearchAgent executes the search using the normalized constraint

This is not philosophical; it's practical.

  • Location extraction is messy (synonyms, partials, ambiguous place names).
  • Search should receive a clean constraint object, not raw transcript text.

When you keep these separate, each piece stays testable.

A minimal, runnable orchestrator

import asyncio
import time
import logging
from dataclasses import dataclass
from typing import Any, Optional

logger = logging.getLogger(__name__)


@dataclass
class AgentResult:
    agent: str
    success: bool
    data: Any = None
    error: Optional[str] = None
    latency_ms: float = 0.0


@dataclass
class OrchestrationResult:
    results: list[AgentResult]
    text: str  # voice-ready summary
    total_ms: float = 0.0


class AgentOrchestrator:
    """Coordinates agents based on RouterAgent decisions.

    Runs chains (e.g. LocationAgent -> SearchAgent) sequentially
    and independent agents concurrently.  Enforces per-agent timeouts.
    """

    def __init__(
        self,
        agents: dict,  # name -> agent instance (each has async .run(query, context))
        router: "RouterAgent",
        agent_timeout: float = 3.0,
        total_timeout: float = 5.0,
    ):
        self.agents = agents
        self.router = router
        self.agent_timeout = agent_timeout
        self.total_timeout = total_timeout

    async def handle(self, transcript: str, context: dict | None = None) -> OrchestrationResult:
        wall_start = time.perf_counter()
        context = context or {}

        # 1. Route
        decision = await self.router.route(transcript)
        logger.info(
            "routed intent=%s agents=%s fallback=%s latency=%.1fms",
            decision.intent,
            decision.agents,
            decision.used_fallback,
            decision.latency_ms,
        )

        # 2. Execute agent chain
        agent_results: list[AgentResult] = []
        chain_context = {**context, "transcript": transcript}

        for agent_name in decision.agents:
            agent = self.agents.get(agent_name)
            if agent is None:
                agent_results.append(
                    AgentResult(agent=agent_name, success=False, error="agent not registered")
                )
                continue

            result = await self._run_agent(agent_name, agent, transcript, chain_context)
            agent_results.append(result)

            # Feed successful output forward so the next agent in the chain can use it
            if result.success and result.data is not None:
                chain_context[agent_name] = result.data

        # 3. Build voice-ready text
        text = self._build_response_text(agent_results, decision)

        total_ms = (time.perf_counter() - wall_start) * 1000
        return OrchestrationResult(results=agent_results, text=text, total_ms=total_ms)

    async def _run_agent(
        self, name: str, agent: Any, transcript: str, context: dict
    ) -> AgentResult:
        start = time.perf_counter()
        try:
            data = await asyncio.wait_for(
                agent.run(transcript, context),
                timeout=self.agent_timeout,
            )
            return AgentResult(
                agent=name,
                success=True,
                data=data,
                latency_ms=(time.perf_counter() - start) * 1000,
            )
        except asyncio.TimeoutError:
            logger.warning("%s timed out after %.1fs", name, self.agent_timeout)
            return AgentResult(
                agent=name,
                success=False,
                error="timeout",
                latency_ms=(time.perf_counter() - start) * 1000,
            )
        except Exception as exc:
            logger.exception("%s failed: %s", name, exc)
            return AgentResult(
                agent=name,
                success=False,
                error=str(exc),
                latency_ms=(time.perf_counter() - start) * 1000,
            )

    @staticmethod
    def _build_response_text(results: list[AgentResult], decision) -> str:
        """Merge agent outputs into a single voice-friendly string."""
        parts: list[str] = []
        for r in results:
            if r.success and r.data:
                # Each agent is expected to return a dict with a "summary" key
                summary = r.data.get("summary") if isinstance(r.data, dict) else str(r.data)
                if summary:
                    parts.append(summary)
            elif not r.success:
                parts.append(f"I wasn't able to complete the {r.agent} step.")

        return " ".join(parts) if parts else "I'm sorry, I couldn't find an answer for that."
Enter fullscreen mode Exit fullscreen mode

In my real codebase, the orchestrator also enforces:

  • timeouts per agent
  • cancellation (if the voice turn times out, stop wasting tool calls)
  • structured error mapping (so voice responses fail gracefully)

Those are not cosmetic details. They're how you stop a slow dependency from turning into a user-facing stall.


VoiceAgentProcessor: the integration layer that makes voice feel stable

Serving the voice stack, VoiceAgentProcessor is the bridge layer I use so that voice routes don't have to know anything about routing rules, agent selection, or tool wiring.

This layer exists because voice has non-negotiable requirements:

  • A stable interface (handle_voice_query) regardless of how orchestration evolves
  • Consistent formatting (short sentences, low ambiguity, no huge JSON dumps)
  • Defensive handling for partial failures

The earlier draft included a truncated snippet (def __init_) that was broken Python. Here's a complete, runnable skeleton that matches the contract shown in the usage example.

import asyncio
import logging
import hashlib
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)

MAX_VOICE_CHARS = 300  # TTS engines get sluggish with long strings
VOICE_TIMEOUT = 4.5    # seconds — hard ceiling for a single voice turn


@dataclass
class VoiceResponse:
    text: str
    success: bool
    latency_ms: float = 0.0
    truncated: bool = False


class VoiceAgentProcessor:
    """Bridge between the voice transport layer and the agent orchestrator.

    Callers use `handle_voice_query` and get back a VoiceResponse.
    Everything about routing, agent selection, and timeout enforcement
    is hidden behind this interface.
    """

    def __init__(self, orchestrator: "AgentOrchestrator", timeout: float = VOICE_TIMEOUT):
        self.orchestrator = orchestrator
        self.timeout = timeout

    async def handle_voice_query(
        self,
        transcript: str,
        user_id: Optional[str] = None,
        session_id: Optional[str] = None,
    ) -> VoiceResponse:
        """Main entry point for voice routes.

        Args:
            transcript: Raw speech-to-text output.
            user_id: Opaque, redacted user identifier (never an email).
            session_id: Voice session ID for log correlation.
        """
        import time

        start = time.perf_counter()

        context = {
            "user_id": self._redact_id(user_id) if user_id else "anon",
            "session_id": session_id or "unknown",
            "channel": "voice",
        }

        try:
            result = await asyncio.wait_for(
                self.orchestrator.handle(transcript, context),
                timeout=self.timeout,
            )
            text = self._format_for_voice(result.text)
            latency = (time.perf_counter() - start) * 1000

            return VoiceResponse(
                text=text,
                success=True,
                latency_ms=latency,
                truncated=len(result.text) > MAX_VOICE_CHARS,
            )

        except asyncio.TimeoutError:
            latency = (time.perf_counter() - start) * 1000
            logger.warning(
                "voice turn timed out after %.0fms session=%s",
                latency,
                session_id,
            )
            return VoiceResponse(
                text="I'm still working on that. Let me get back to you in a moment.",
                success=False,
                latency_ms=latency,
            )

        except Exception as exc:
            latency = (time.perf_counter() - start) * 1000
            logger.exception("voice query failed session=%s: %s", session_id, exc)
            return VoiceResponse(
                text="Something went wrong. Could you try asking again?",
                success=False,
                latency_ms=latency,
            )

    @staticmethod
    def _format_for_voice(text: str) -> str:
        """Trim and clean text for TTS output."""
        text = text.strip()
        if len(text) <= MAX_VOICE_CHARS:
            return text

        # Cut at the last sentence boundary within the limit
        truncated = text[:MAX_VOICE_CHARS]
        last_period = truncated.rfind(".")
        if last_period > MAX_VOICE_CHARS // 2:
            return truncated[: last_period + 1]
        return truncated.rstrip() + "..."

    @staticmethod
    def _redact_id(user_id: str) -> str:
        """One-way hash so logs never contain raw identifiers."""
        return hashlib.sha256(user_id.encode()).hexdigest()[:12]
Enter fullscreen mode Exit fullscreen mode

A few details here are deliberate:

  • user_context uses a redacted identifier, not an email.
  • asyncio.wait_for enforces a real voice turn budget.
  • formatting trims long responses before they hit TTS.

That's the difference between "we have agents" and "this feels like a voice product."


Why model-first routing fails in voice

If you skip the deterministic router and always ask a model "what should I do?", you pay for:

  • a model call even when the intent is obvious
  • unpredictable latency spikes (network, load, queueing)
  • less determinism (harder to debug why a tool was selected)

And you also create a nasty operational blind spot: when users complain, you end up with hand-wavy explanations ("the model misrouted") instead of actionable ones ("pattern X didn't match because of phrasing Y").

My RouterAgent keeps the common cases boring.

Boring is good.


Nuances that matter once this is real

1) Deterministic routing needs observability, not just patterns

Patterns don't improve by hope. I log three routing outcomes:

  • match: which rule fired, how long it took
  • no-match: fell to LLM fallback (record the query shape, not PII)
  • multi-match: two rules could apply; record and resolve with precedence

That gives me a backlog of "worth adding a pattern" candidates.

2) Keep the router's job small

The fastest router is the one that doesn't do extra work. I explicitly avoid:

  • entity extraction beyond what's needed for routing
  • summarization
  • query rewriting

If the router starts doing "smart" stuff, it becomes the most-called agent and therefore the largest risk.

3) The fallback must be controlled

LLM fallback is a pressure valve, not a blank check.

I treat fallback like a real dependency:

  • it has a timeout
  • it has a default path on failure
  • it's measured separately

That's how you prevent a rare ambiguous query from degrading the entire voice experience.


Closing

The big change wasn't adding more intelligence. It was moving intelligence out of the critical path.

Once I treated routing as a latency budget problem, the architecture snapped into something that actually works for voice: a deterministic RouterAgent that's fast enough to run every turn, an orchestrator that coordinates without improvising, and a VoiceAgentProcessor that enforces time budgets and response shape so the user hears something crisp instead of waiting through silence.

Top comments (0)