Realtime Voice Barge-In - From Awkward Turns to Natural Conversation

This refactoring was not just about adding providers.

It was about removing ambiguity in voice mode.

I used to ask a simple question: "Why does voice still feel awkward even after latency work?"

The answer was not one bug. It was a messy set of interaction problems:

sometimes slow STT
ghost transcripts from native paths
no true interruption when Iris was already speaking

That is where experience helped most: make the fuzzy problem concrete first, then ship.

The hidden trap with voice features is this: you can improve metrics and still ship a bad feeling.

I had already done latency work before. Some numbers got better. But the conversation still had friction because the interaction model itself was wrong for natural back-and-forth.

So this phase was less about "adding speed" and more about "removing awkward moments."

The ambiguity reduction checklist I used

Before writing code, I forced the problem into clear questions:

What exact moment feels broken in real use?
What is a symptom, and what is the actual constraint?
What is the smallest change that improves the experience now?
If this direction is wrong, how expensive is it to roll back?

That gave me a cleaner target: make voice feel interruptible and continuous, not perfect.

How I filtered options without getting lost

There were several possible directions. I needed a filter, not more ideas.

flowchart TD
    A["Voice still feels awkward"] --> B{"Can this fix the interruption moment?"}
    B -->|No| C["Defer for later"]
    B -->|Yes| D{"Can I ship it with current architecture?"}
    D -->|No| E["Too big right now"]
    D -->|Yes| F{"Safe rollback if wrong?"}
    F -->|No| G["High risk, avoid right now"]
    F -->|Yes| H["Ship in realtime mode behind feature flag"]

This helped me avoid overbuilding. I did not need the "ultimate voice stack" in one go. I needed the first version that made conversation feel natural.

Architecture snapshot: before vs now

Before (legacy, turn-based)

flowchart LR
    Mic["User mic audio"] --> STT["STT (Browser / Chutes / Eleven)"]
    STT --> LLM["LLM stream (SSE)"]
    LLM --> TTS["TTS (Chutes / Eleven)"]
    TTS --> Spk["Speaker output"]
    Spk -. "User waits for full turn, then speaks again" .-> Mic

Now (realtime mode with barge-in)

flowchart LR
    Mic["User mic audio"] --> VAD["VAD (always on in realtime mode)"]
    VAD --> STT["STT (Groq / Deepgram / Browser)"]
    STT --> LLM["LLM stream (SSE)"]
    LLM --> TTS["TTS (Deepgram / Chutes / Eleven)"]
    TTS --> Spk["Speaker output"]
    VAD -. "Speech detected while AI is speaking" .-> Cancel["Barge-in: cancel TTS queue + cancel SSE stream"]
    Cancel --> VAD

Analogy: from walkie-talkie to conversation

The old version behaved like a walkie-talkie: one person talks, then releases the button.

The new version behaves closer to real conversation: if you start talking, the other side stops and listens.

Not perfect full-duplex telephony, but a big UX jump.

Another analogy: it used to feel like waiting at a one-lane bridge with traffic lights. One side goes, then waits. Now it feels closer to a roundabout where flow can continue and adjust quickly.

How the plan was formed (and why prompting mattered)

This did not come from one-shot generation. I used Claude's AskUserQuestion loop for about 15 minutes and answered questions back-to-back.

I gave concrete context:

ElevenLabs docs
websocket option we were considering
our current SSE + HTTP stack
existing cancellation primitives
budget and rollout constraints

Prompt quality was the multiplier here. I pushed for tradeoff decisions, not a feature wishlist. Claude's AskUserQuestion loop is a powerful tool for getting to the heart of a problem. It researched about Groq, Deepgram, ElevenLabs, and our current stack. Gave me a cost breakdown. Found out - Deepgram gives $200 credit for free. Groq has free tier. With my usage Eleven Labs would have been most expensive at arounf ~£75 a month.

That turned a fuzzy ask into an executable plan with scope, order, and verification.

The most useful part was not getting answers. It was being forced to clarify assumptions:

where I was optimizing for feel vs raw latency
where complexity would create maintenance cost
where a "clean architecture" could still be the wrong product move today

That clarity is what made implementation fast afterward.

What actually shipped in the last commit

Highlights:

Added Groq STT and Deepgram STT/TTS gateways and providers
Added realtime mode flag and UI mode toggle
Implemented client-side barge-in in chat voice loop
Updated provider selector for new STT/TTS options
Tuned VAD behavior for faster turn handling
Added voice-focused feature tests for new providers and config

Decision table (pain -> choice -> impact)

Pain	Choice	Why it was the right tradeoff
Interruptions felt broken	Client-side barge-in	Fastest path to real UX gain with existing primitives
STT speed inconsistency	Add Groq and Deepgram STT	Gives fast options without replacing whole stack
Design uncertainty	AskUserQuestion + docs-driven loop	Reduced ambiguity before implementation
Risk of large rollout	Feature-flagged realtime mode	Safe rollout with legacy fallback

Key technique: client-side barge-in

When VAD detects speech while Iris is speaking:

Stop queued and active TTS playback
Cancel active SSE model stream
Move state to listening immediately
Continue normal capture/transcribe path

This is why the interruption feels instant.

The interruption moment (step by step)

sequenceDiagram
    participant U as Me
    participant V as VAD
    participant T as TTS
    participant S as SSE Stream
    participant A as Assistant

    A->>T: Speaking response
    U->>V: Start talking mid-response
    V->>T: cancel()
    V->>S: cancel()
    V->>A: Set state to listening
    U->>V: Continue speaking
    V->>A: Route speech to STT path

This sequence is simple on purpose. The simpler this path is, the less likely it is to break in real use.

What changed for me

Before:

conversations felt turn-locked
interrupting was awkward
occasional ghost transcript behavior

After:

I can interject naturally
AI stops and listens when I start speaking
faster STT paths are available
A smile on my face

The biggest practical difference is conversational confidence. I no longer hesitate before speaking because I know interruption works.

That one behavioral shift matters more than most micro-optimizations.

Scope and decision

I intentionally scoped this release around interruption and conversational flow.

That was the highest-friction moment in real use, so that is where I spent engineering effort first.

The result is a meaningful UX jump with controlled complexity and low rollout risk.

Bottom line

The biggest win was not just new providers.

The biggest win was reducing ambiguity in the problem definition, then shipping the smallest architecture change that produced a noticeable UX jump.