Iris has a voice mode. You tap the orb, speak, and it responds with audio. The problem was it felt like talking to someone on a really bad phone connection - you'd finish speaking, wait half a minute, then suddenly hear the full response.
Here's what was actually happening under the hood and how I fixed it.
The waterfall problem
The original flow was painfully sequential:
- User speaks, browser transcribes it
- Text goes to the chat endpoint, which streams the response back
- Frontend waits for the entire stream to finish
- Full response text gets sent to the TTS endpoint in one big request
- TTS synthesises the entire thing and returns base64 audio
- Audio plays
- Only then does it start listening again
For a short "hello" this was tolerable - maybe 3-4 seconds total. But ask anything that requires actual thinking and you're staring at the orb for 20-30 seconds before hearing a single word. The model I'm using (Kimi K2.5 TEE on Chutes) is a reasoning model. It thinks before it speaks. That thinking time, plus the full synthesis time, plus Nginx buffering the stream - it all stacked up.
Three fixes, layered together
1. Stop Nginx from buffering the stream
The streaming response from the chat endpoint was missing the X-Accel-Buffering: no header. Without it, Nginx holds the streamed chunks in a buffer before forwarding them to the browser. I was literally getting the data from the AI provider, then Nginx was sitting on it. Added that header plus Cache-Control: no-cache, no-transform to all streaming responses. Free improvement.
2. Chunked TTS instead of full-response TTS
This was the big one. Instead of waiting for the complete response and synthesising it all at once, I built a chunked TTS system that works like this:
- As text streams in from the chat endpoint, it accumulates in a buffer
- When a sentence boundary is detected (period, exclamation mark, question mark followed by whitespace) and we have at least 30 characters, that chunk gets sent to the TTS endpoint immediately
- Audio chunks get queued and played sequentially
- When the stream finishes, any remaining buffered text gets flushed as a final chunk
So now the first sentence starts playing while the rest of the response is still streaming in. For a response like "Everything appears to be functioning normally on my end. How can I assist you today?" - the first sentence starts playing almost immediately after it finishes streaming, while the second sentence synthesises in the background.
There's also handling for edge cases: <think> blocks from reasoning models get stripped (you don't want to hear the model's internal reasoning read aloud), system markers get removed, and very long text without sentence boundaries gets force-split at 300 characters.
3. Prefetch the next audio chunk
After getting chunked TTS working, I noticed a pause between sentences. Looking at the network tab, the cause was obvious: the drain loop was purely sequential. Play chunk 1, then synthesise chunk 2, then play chunk 2. The synthesis gap between chunks was noticeable.
The fix: while chunk N is playing, start synthesising chunk N+1 in parallel. By the time the audio finishes, the next chunk's audio is already ready. The gap between sentences is now effectively zero.
What I also did on the backend
While I was at it, I added timing instrumentation to the streaming pipeline. Every AI stream now logs ttft_ms (time to first token), duration_ms, delta_count, provider, and model. This is how I can actually measure whether changes help or whether I'm just guessing.
I also trimmed the prompt payload. The system prompt was sending up to 50 memories, unlimited skills, and the full agent roster on every request. Capped memories to 30, skills to 10, agent roster to 15. Not a huge difference for latency - the real bottleneck is the AI provider's inference time - but there's no reason to send tokens we don't need.
What I didn't fix
The 20-30 second time-to-first-token on complex queries is still there. That's the AI provider's inference latency - Kimi K2.5 TEE running on Chutes' shared GPU infrastructure. No amount of backend optimisation changes how long it takes the model to start generating. The chunked TTS approach masks this better because at least you start hearing audio as soon as the first sentence arrives, but the initial wait is still there.
True barge-in (interrupting the AI by speaking over it, without clicking the orb) is also not implemented yet. That would need the microphone to stay active during playback with echo cancellation, which is a separate piece of work.
The result
Before: speak, wait 20-30 seconds in silence, hear the entire response at once.
After: speak, wait for the first sentence to stream and synthesise, hear it while the rest continues streaming. Follow-up sentences play back-to-back with no gaps thanks to prefetch. The total time to hear everything is roughly the same, but the perceived latency is significantly better because you're not sitting in silence for the full duration.