Voice Mode Reliability Overhaul - Making It Actually Work for More Than Two Minutes

Yesterday's post was about latency. Today's is about reliability. The chunked TTS system made voice mode feel fast, but it would silently die after a few minutes. You'd finish speaking, wait, and nothing would happen. The orb would just sit there. Reload the page, try again, same thing two minutes later.

I spent most of today tracing every path the conversation loop can take and fixing every place it can silently break.

The conversation loop

Voice mode is a cycle: listen, transcribe, send to AI, speak the response, listen again. Five stages, each handing off to the next. The problem is that if any stage fails to trigger the next one, the loop just stops. No error, no crash, just silence.

Browser STT silently gives up

The Web Speech API has a built-in silence timeout. If you don't say anything for a few seconds, it fires onend and stops listening. My code handled onresult (when speech is detected) but completely ignored onend when no result was produced. The loop would die right there.

Fix: added onEnd and onError callbacks to the browser STT hook. When STT ends without producing a result, the caller gets notified and can restart listening. In voice mode, it auto-restarts after 300ms. Same for errors like no-speech - just restart.

Server STT (Kokoro) never worked in voice mode

This one was embarrassing. When you select Kokoro as the STT source, it uses the MediaRecorder API to capture audio, then sends the blob to the backend for transcription. The problem: MediaRecorder has no silence detection. It records until you tell it to stop.

In voice mode, the only way to stop recording was tapping the orb. But tapping the orb called stopVoiceActivity, which sets a "cancel" flag before stopping the recorder. The onstop handler sees the flag and skips transcription. So tapping the orb to "finish speaking" actually threw away the recording.

Fix: created a separate stopListeningForTranscription function that stops the MediaRecorder without the cancel flag, so the onstop handler transcribes normally. Added a 10-second auto-stop timer so users don't have to tap at all. The orb click during listening now calls this graceful stop instead of the nuclear cancel.

The TTS drain promise hung forever

When you cancel TTS mid-playback, audio.pause() is called. But the drain loop was waiting on a promise that only resolved on onended or onerror. Pausing an audio element fires neither. The promise just hung, and the drain loop never continued, so onComplete never fired, so listening never restarted.

Fix: added onpause to the promise resolution. Also re-check the cancelled flag after the promise resolves, in case cancel happened during playback.

Flush ignored cancellation

The flush() function (called when the stream ends to push remaining text to TTS) didn't check whether cancel had been called. So if you interrupted mid-stream, the stream would finish, onFinish would call flush(), which would push text into the now-cancelled TTS queue and call signalComplete(), which would restart listening on top of whatever the interrupt was trying to do.

Fix: flush() now returns immediately if cancelled.

Stream errors broke the loop permanently

If the streaming response hit a network error, onError would cancel TTS and show a toast. But it wouldn't restart listening. Voice mode just died.

Worse: when the user intentionally tapped the orb to interrupt, the stream's cancel() triggered onError, which showed a spurious "Network error" toast and then tried to auto-restart listening after a delay, undoing the interruption.

Fix: added an intentionalCancelRef. Both stopVoiceActivity and interruptVoice set it before calling cancel. The onError handler checks this flag and does nothing if the cancel was deliberate. Real network errors still trigger recovery.

Error responses got spoken aloud

When the AI stream ended with an error marker like [ERROR:rate_limited], the frontend would still call chunkedTts.flush(), which meant the error text got synthesised and read out loud. Not useful.

Fix: when the stream ends with an error marker and voice mode is active, cancel TTS instead of flushing, and restart listening directly.

voiceState got stuck or flickered

The voiceState effect tried to derive the current state from a set of boolean flags (isRecording, isTranscribing, isFetching, isStreaming, browserStt.isListening). I added an else clause to reset to idle when all flags were false. This caused flickering between "Ready" and "Listening" during every transition gap, because there's always a brief moment between one stage ending and the next starting where all flags are false.

Fix: removed the idle fallback entirely. The effect only promotes state forward (to listening, transcribing, thinking). Transitions back to idle are handled explicitly by the functions that are supposed to trigger them. No more flicker.

Interrupt (tap to stop talking)

Previously there was no way to interrupt Iris while she was speaking. You had to reload the page. The orb technically called stopVoiceActivity during speaking, but the state sometimes didn't match due to the flickering issue, so the click handler would fire the wrong function.

Fix: explicit interruptVoice function that cancels TTS, stops the stream, and immediately restarts listening. The orb click handler now maps each voice state to the right action: idle starts listening, listening stops for transcription, speaking/thinking interrupts, transcribing cancels.

Noise filtering

Browser STT was triggering on breathing, keyboard clicks, and background noise. Server STT was sending tiny blobs of silence for transcription.

Fix: browser STT now requires confidence >= 0.4 and minimum 2 characters. Server STT skips recordings under 600ms.

Markdown getting read aloud

TTS was pronouncing **bold** as "asterisk asterisk bold asterisk asterisk" and ## Header as "hash hash Header". Not ideal.

Fix: added stripMarkdown() that removes headers, bold/italic markers, code blocks, links, list markers, blockquotes, and HTML tags before sending text to the TTS API.

Voice catalogue

Expanded the Kokoro voice list from 4 hardcoded options to all 28 English voices (British and American) driven by a KokoroVoice PHP enum. The frontend fetches the voice list from a backend API endpoint instead of hardcoding it. Voices are grouped by accent and gender in the dropdown. Default changed to Emma (UK) because that's who I want to talk to.

The result

Voice mode now survives continuous conversation without dying. You can interrupt mid-response. Silence timeouts auto-restart. Network errors recover. Server STT actually transcribes. The TTS doesn't read markdown formatting. And there are 28 voices to pick from instead of 4.