Replacing the Voice Pipeline With a Single WebSocket

Every voice feature I have shipped until now has been three separate things bolted together: STT to turn speech into text, an LLM to think, and TTS to turn the answer back into audio.

That architecture works. But it has a fundamental ceiling.

Each stage has its own latency budget. Each stage has its own failure mode. And stitching three async pipelines together in the browser, with cancellation, queue management, and barge-in detection layered on top, is a lot of moving parts for something that should feel as natural as a phone call.

This release replaces all of that with a single persistent WebSocket.

What Deepgram Voice Agent actually is

Deepgram has a product called the Voice Agent API. Instead of exposing separate endpoints for STT, LLM, and TTS, it exposes one WebSocket connection that handles the whole pipeline internally.

You send raw PCM audio frames. The agent transcribes them, decides when the user has finished speaking, runs the LLM, synthesises speech, and sends PCM audio back. Barge-in is native: when Deepgram detects you speaking while it is playing audio, it fires a UserStartedSpeaking event and stops itself.

The "think" provider (the LLM) is pluggable. I pointed it at my existing model configuration via an OpenAI-compatible endpoint, so Iris's brain stays the same. Only the delivery mechanism changed.

Deepgram is also generous with new accounts: you get $200 in free credit to start, with no expiry and no credit card required. The Voice Agent API is billed per minute of WebSocket connection time. On the Pay As You Go tier, Standard (Deepgram's own STT, LLM, and TTS) runs $0.08/min (around $4.80/hour). If you bring your own LLM it drops to $0.07/min, and if you bring your own LLM and TTS it drops further to $0.05/min (around $3/hour). There is also a Growth tier with slightly lower rates across the board. For a personal assistant that is not running 24/7, the free credit alone covers a considerable amount of real usage.

Before vs now

Before this release, the voice pipeline looked like this:

flowchart LR
    Mic["Mic audio"] --> STT["STT request (HTTP)"]
    STT --> LLM["Chat endpoint (SSE stream)"]
    LLM --> TTS["TTS request (HTTP, chunked)"]
    TTS --> Spk["Speaker"]
    Spk -. "Wait for full turn" .-> Mic

Each arrow is a separate HTTP call. Each one adds latency and a new place to fail. Chunked TTS helped mask the wait, but the underlying waterfall was still there.

Now:

flowchart LR
    Mic["Mic audio (PCM16)"] --> WS["Single WebSocket"]
    WS --> Spk["Speaker (PCM16)"]
    WS <-->|"FunctionCallRequest / FunctionCallResponse"| App["App backend (tool proxy)"]

One persistent connection carries everything: incoming audio, outgoing audio, tool calls, state events. The STT-LLM-TTS sequencing happens inside Deepgram.

The session startup handshake

To keep the browser from ever holding the API key directly in source, the session flow is:

sequenceDiagram
    participant B as Browser
    participant A as App (Laravel)
    participant D as Deepgram

    B->>A: POST /api/voice-agent/start
    A->>B: { websocket_url, api_key, settings }
    B->>D: WebSocket connect (api_key as subprotocol)
    D->>B: Welcome
    B->>D: Settings (model, voice, tools, instructions)
    D->>B: SettingsApplied
    B->>D: PCM audio stream begins

The backend resolves the right model, voice, agent, tools, and prompt before returning the settings payload. The browser just forwards that blob to Deepgram.

The tool call proxy

The voice agent can call tools mid-conversation. Deepgram sends a FunctionCallRequest event to the browser with the function name and arguments. The browser cannot execute tools directly, so it proxies through the app:

sequenceDiagram
    participant D as Deepgram
    participant B as Browser
    participant A as App

    D->>B: FunctionCallRequest { name, id, arguments }
    B->>A: POST /api/voice-agent/tool-call
    A->>A: Execute tool via VoiceAgentToolBridge
    A->>B: { output, ui_output }
    B->>D: FunctionCallResponse { id, content: output }

The critical design decision here: VoiceAgentToolBridge reuses MainAgent tool resolution. Voice mode and text mode run exactly the same tools, with the same user and agent allowlist rules applied. There is no separate voice tool registry to maintain.

Splitting voice output from UI output

Some tools produce results that should not be spoken aloud. An image URL is useless when read out. A long signed S3 URL is embarrassing.

The tool call service returns two separate outputs for every call: voice_output (what Deepgram says to the user) and ui_output (what gets saved to the chat and rendered as a media card).

For image generation, that looks like this:

voice_output: "Image generated successfully. I saved it to the chat preview."
ui_output: [IMAGE:https://...] marker that the chat renderer picks up

URLs never pass through the spoken response. The user hears a clean confirmation, and the image appears in chat alongside the voice transcript.

Message persistence

Voice conversations are transcribed by Deepgram in real time via ConversationText events. But those transcripts live in the browser, not in the database. To make voice sessions show up in chat history, the browser batches them and flushes to /api/voice-agent/messages after each completed turn, and again when the session ends.

flowchart TD
    DG["Deepgram ConversationText event"] --> Acc["Accumulate in memory"]
    AudioDone["AgentAudioDone event"] --> Flush["POST /api/voice-agent/messages"]
    Stop["Session stop"] --> FlushAll["Flush remaining messages (up to 8 retries)"]
    Flush --> DB["Saved to conversation"]
    FlushAll --> DB

This means if you switch from voice mode to text mode mid-conversation, the history is already there.

Smart greeting: first turn vs resume

When you open voice mode on an existing conversation, "Hello, how can I help you?" is annoying. The session service checks whether the conversation already has user-visible messages, and uses a different (shorter, contextual) resume greeting if so. Both greetings are configurable per-agent and per-user preference.

Tool schema optimisation

Voice agents run on constrained TPM budgets. Sending the full tool payload from text chat would burn tokens on every turn.

The VoiceAgentConfig::optimiseToolSchemas method trims the schema before it gets included in the settings:

Descriptions capped at 140 characters
Schema noise stripped (examples, title, default, description fields removed from properties)
Tool count capped at a configurable maximum (default: 12)
Priority order preserved so high-importance tools are always included

The LLM still sees the tools it needs. It just does not waste tokens on schema documentation it would ignore anyway.

AudioWorklet for PCM audio

The browser side uses the Web Audio API's AudioWorklet for low-latency audio capture and playback. The worklet runs in a dedicated thread, emitting raw 16kHz PCM16 frames for the microphone and accepting 24kHz PCM16 frames from Deepgram for playback.

Signal level is computed in the worklet and sent back to the main thread to drive the orb animation. The orb reflects what is actually happening:

Listening: tracks microphone input level
Speaking: tracks playback output level
Thinking: holds a gentle idle pulse

Barge-in is handled by the UserStartedSpeaking event from Deepgram (native, not client-side VAD). When it fires, the browser clears the playback buffer immediately. That is the "it just stops" moment that makes conversation feel natural.

What changed for me

The old voice mode had distinct phases you could feel:

Speak
Wait for STT
Wait for LLM to start
Wait for first TTS chunk
Listen

The new voice mode has no perceptible phase transitions. Speech goes in, speech comes back. Tool calls happen silently in the background. Interrupting works without any extra setup.

The most significant thing is not any single latency number. It is that I have stopped thinking about the infrastructure when I am talking to Iris.

Decision table

Problem	Approach	Tradeoff
Multi-stage pipeline latency	Single WebSocket voice agent	Less control over individual pipeline stages
API key security	Backend session start, never expose key in frontend	Extra round trip on session open
Tool access in voice mode	Reuse MainAgent tool resolution via bridge	Voice tools must be compatible with existing tool interfaces
URLs in spoken output	Split voice_output / ui_output	Slightly more complex tool call service
Transcript persistence	Browser batches and flushes ConversationText events	Transcripts lost if browser closes before flush
Token budget	Schema optimisation on session start	Truncated descriptions, tool count cap

Bottom line

The voice pipeline is now a single abstraction. One WebSocket, one settings payload, one connection to manage.

The browser does less. The backend does one clear job (resolve session config and proxy tool calls). Deepgram handles everything in between.

That simplicity is what makes the conversation feel different.

Glossary

PCM16 - Raw audio samples, 16-bit signed integers, no compression, no container format. Think of it as the audio equivalent of a bitmap image: unencoded, just the numbers. The microphone captures at 16kHz (16,000 samples per second), Deepgram sends audio back at 24kHz. It is fast to encode and decode because there is nothing to encode or decode.
WebSocket - A persistent two-way connection between the browser and a server. Unlike normal HTTP where you make a request and get a response, a WebSocket stays open and both sides can send data at any time. That is what makes real-time audio streaming possible.
STT (Speech-to-Text) - Converts spoken audio into a text transcript. The "listen" stage.
TTS (Text-to-Speech) - Converts a text string into spoken audio. The "speak" stage.
LLM (Large Language Model) - The AI model doing the actual thinking. In this setup, the "think" stage.
VAD (Voice Activity Detection) - Software that detects whether a person is currently speaking or not. Used to know when to stop listening and start processing.
Barge-in - The ability to interrupt the AI while it is speaking and have it stop and listen to you instead. Without it, you have to wait for the AI to finish before you can say anything.
AudioWorklet - A browser API that lets you run audio processing code in a dedicated background thread, separate from the main UI thread. Lower latency than older approaches because the audio processing never has to wait for the page to finish rendering.
TPM (Tokens Per Minute) - A rate limit imposed by LLM providers. Every word, punctuation mark, and piece of context sent to the model costs tokens. Voice agents are particularly sensitive to this because tool schemas add overhead on every single turn.

What's next

The provider abstraction (VoiceAgentProvider) was built with exactly this in mind. Next up is implementing the same flow with Inworld Realtime, which is a direct competitor: WebSocket-based, speech-to-speech, same single-connection model. It follows the OpenAI Realtime protocol and has native interruption handling built in. It is currently in research preview, so pricing for the Realtime API is not yet public, but their TTS layer alone runs $5-10 per million characters - significantly cheaper than Deepgram's TTS in isolation. Whether that holds up at the full voice agent level remains to be seen.

Once both are running, I will do a proper side-by-side comparison: real-world latency, barge-in feel, tool call reliability, and cost. Should be interesting.

Deepgram vs Inworld: cost breakdown

Deepgram bundles everything into a flat per-minute rate. Inworld works differently: you pay for each component separately, and LLM cost is passed through at the provider's own rates with no markup.

	Deepgram	Inworld
STT	Bundled	$0.15/hour ($0.0025/min)
LLM	Bundled	Pass-through at provider rate (no markup)
TTS	Bundled	$0.005/min (Mini) or $0.01/min (Max)
All-in Standard	$0.08/min	n/a (you assemble it)
BYO LLM + TTS	$0.05/min	n/a (not applicable, always BYO)
Estimated total (BYO LLM, TTS Mini)	$0.05/min	~$0.0075/min
Free credit	$200	$1 free credit for one year (STT + TTS)
Voice cloning	Not included	Free (zero-shot, no extra charge)
Status	Generally available	Research preview

The estimated $0.0075/min for Inworld assumes Deepgram STT ($0.0025/min) plus TTS Mini ($0.005/min), with the LLM cost separate on top. Since I already use Chutes and Groq and pay for those directly, the LLM line is effectively the same either way. The voice infrastructure cost alone would drop from $0.05/min to roughly $0.0075/min - about 6-7x cheaper.

Why I might eventually switch to Inworld

A few reasons this is worth watching:

Protocol alignment. Inworld Realtime follows the OpenAI Realtime protocol. Deepgram uses standard WebSocket transport (RFC-6455) but with its own event schema on top - Welcome, SettingsApplied, FunctionCallRequest, and so on - which is Deepgram-specific. Inworld's alignment with the OpenAI Realtime protocol means it is a more portable choice if I ever want to swap providers again, since more tooling speaks that protocol natively.
TTS quality and voice cloning. Inworld's voice models are consistently rated highly in independent benchmarks, particularly for expressiveness and naturalness. They also include zero-shot voice cloning at no extra charge - meaning Iris could speak in a custom cloned voice without paying anything on top of the standard TTS rate. Deepgram has no equivalent of this.
Cost trajectory. Research preview products are almost always cheaper at launch to attract adoption. If Inworld prices the full Realtime API aggressively, the math could shift.
The abstraction is already there. The VoiceAgentProvider interface means switching is a matter of writing one new provider class and updating config. No rearchitecting required.

That said, Deepgram is production-ready today and the $200 credit means I am not paying anything for a while. Inworld gives $1 in free credit for a full year on signup (covering STT and TTS), which at $0.0075/min goes further than it sounds. So the plan is: ship with Deepgram now, implement Inworld when the Realtime API graduates from preview, run them in parallel, and let the numbers decide.