Iris had image analysis. Iris had video generation. What it did not have was video understanding.
I kept sending clips and asking "what is happening here?" and the system had no proper pipeline for that request. So this feature was not a "nice-to-have" polish pass. It was a missing capability.
The interesting part wasn't coding the happy path. The interesting part was untangling the contradictions and building something reliable enough that I trust it in production.
The first contradiction: docs vs reality
At first glance, the story looked straightforward. Chutes model metadata showed input_modalities including video for models like Kimi. Great, wire video_url, done.
Then the upstream model docs said video support was limited to specific serving paths. That creates a classic integration trap: one source says yes, another says maybe not.
So I stopped guessing and tested the real endpoint behavior directly.
Result: video requests were accepted and processed in the deployed path I actually use. That settled the decision: ship it.
Decision 1: don't hardcode one model
The obvious quick fix would have been "just force Kimi." I didn't want that.
Model catalogs move too fast, and model quality shifts by task. So the routing path was built to be config-driven:
- provider defaults and allowlists
- per-provider video analysis model
- capability checks based on model metadata
Chutes is implemented now, but the structure is there for additional providers.
Decision 2: no local disk for analysis inputs
Requirement was explicit: do not store analysis videos on local disk.
So analysis runs on URL references (or in-memory payload conversion for direct API calls), not temp file persistence in the chat path. For queued jobs tied to uploaded attachments, the source is resolved from existing object storage URLs.
This keeps operational behavior consistent with how attachments already flow through the system.
The real bottleneck: synchronous timeouts
Once video analysis worked, the next issue appeared immediately: request duration variance.
Small clips came back quickly. Larger clips could run long enough to hit timeout ceilings. That's where most "it works on my machine" features die.
So I introduced a hybrid policy instead of pretending one path fits all:
- small videos: synchronous inline analysis
- larger videos: queue immediately, return
job_id, complete in background
Now users get deterministic behavior instead of waiting on a fragile long request.
Decision 3: video analysis needed its own job type
I considered piggybacking on existing media generation job records. Rejected that.
Generation and analysis have different payloads, outputs, and lifecycle semantics. Forcing them into one table would make both harder to reason about.
So video analysis got a dedicated queue model + job worker + presenter shape. Status tracking remains accessible via the existing media jobs endpoint pattern, so UX remains familiar.
Telegram forced a separate queue rule
On web uploads, I can usually rely on stored attachment metadata (including size) to decide queue-vs-sync.
Telegram is different. Attachments enter as remote file URLs, and size metadata is not always available where the decision happens.
That made threshold-only logic inconsistent for Telegram videos.
So Telegram got a targeted policy: auto-queue Telegram-hosted video analysis by default. Predictable behavior beats clever heuristics here.
Thinking mode had to be explicit
Kimi-style reasoning models can spend extra tokens and latency in thinking mode. Sometimes that's worth it for video interpretation. Sometimes it's overkill.
I made thinking control configurable for video analysis instead of silently forcing one behavior forever. That's a cost/latency knob, not a hardcoded product belief.
What this sprint actually changed
Before:
- no real video analysis pipeline
- no reliable behavior for longer clips
- Telegram path had inconsistent queue decisions for video analysis
After:
- video analysis is a shipped capability
- large video analysis is queue-safe with explicit job tracking
- completion results are delivered back into the same conversation
- Telegram video analysis is predictably queued
- provider/model selection is config-driven instead of locked to one model
I used Codex as an implementation copilot through this: I drove the product and architecture decisions, and Codex helped accelerate the code changes, refactors, and verification loops across backend and UI.
This is the kind of feature where "the model can do it" is maybe 30% of the work.
The rest is systems engineering: capability detection, queue strategy, channel-specific behavior, failure modes, and operational defaults that don't punish users for sending real-world inputs. This is a strong stepping stone toward the Frigate NVR integration on the roadmap. Frigate can handle detection and event extraction, Iris can handle context and escalation logic, and if full-clip video analysis is unstable in specific cases, I can fall back to key-frame image analysis to preserve reliability while still producing useful event summaries.