Streaming AI Responses to Telegram in Real Time

The web UI streams token by token. You see words forming as the model thinks. Telegram didn't do any of that. You'd send a message, watch "typing..." for 15 seconds, then get the entire response dumped into your chat at once. It felt like sending an email and waiting for a reply, not like having a conversation.

Why it was slow to begin with

The webhook handler was doing something called store-and-forward. It consumed every chunk from the AI stream into a PHP array, joined them into a single string, then fired one sendMessage call at the very end. The model might start producing output after 3 seconds, but you wouldn't see any of it until the last token arrived. All the streaming infrastructure on the backend was doing its job perfectly, and then the Telegram layer threw it all away by buffering everything.

This is the same bottleneck pattern I fixed in the sub-agent pipeline back in January. Data flowing through a system but getting held at an intermediate point before delivery. The fix is also the same idea: stop buffering, start forwarding.

Telegram doesn't support real streaming

Unlike a browser that can consume a chunked HTTP response or an SSE stream, Telegram's Bot API is request-response only. You send a message, you get a message back. There's no way to append text to an existing message as it generates.

But there's editMessageText. You can change the content of a message you've already sent. OpenClaw's approach (and now mine) is to exploit this: send an initial placeholder, then keep editing it as more content arrives. The user sees text growing inside the same message bubble. It's not true streaming, but the visual effect is close enough.

The edit-in-place loop

The flow now works like this. When the AI stream starts, the bot sends a ... placeholder message and captures the message_id that Telegram returns. As chunks arrive from the AI provider, they accumulate in a buffer. Every 1.5 seconds, the buffer gets flushed to an editMessageText call that replaces the message content with everything received so far, plus a trailing ... to indicate more is coming. When the stream ends, one final edit replaces the content cleanly with no indicator.

The 1.5-second interval isn't arbitrary. Telegram throttles edits to roughly 30 per minute per chat. Going faster risks hitting 429 Too Many Requests and losing edits. 1.5 seconds stays comfortably within the limit while still feeling responsive. There's also a 20-character minimum delta before an edit fires, so a burst of tiny tokens doesn't trigger unnecessary API calls that all look the same to the user.

Handling the 4096-character wall

Telegram caps message length at 4096 characters. The old code dealt with this by truncating at 3500 characters and throwing away everything after. That's fine for short answers but terrible for anything substantive.

Now if the response exceeds 4000 characters (keeping a buffer for safety), the first 4000 stay in the original edited message and any overflow spills into new messages. Long-form answers get delivered in full instead of being silently chopped.

When it can't stream

If the initial placeholder send fails - network hiccup, temporary rate limit, anything - the system falls back to the old behaviour. Collect all chunks, send once at the end. No data loss. The user gets their answer, just without the progressive reveal. Graceful degradation, not a crash.

The same fallback applies if sendMessageWithId returns a null message ID. The code doesn't assume the happy path. It checks, and it has a plan B.

Queue streaming

There's a background job called ProcessConversationQueueEntry that handles messages queued behind an active response. Previously it collected all chunks and sent one final message, same as the webhook path used to. Now it calls the same streamToTelegram method that the live path uses. A queued message gets the same progressive delivery as a direct one.

What it feels like now

Before: send a message, stare at "typing..." for 15 seconds, wall of text appears. After: send a message, see ..., watch words filling in every couple of seconds, final clean message. The AI hasn't gotten faster. The perceived latency just dropped because showing partial progress is always better than showing nothing. Same principle as the chunked TTS work on voice mode - you don't need to be faster if you can start delivering sooner.