May 19, 2026· video-ai· live-streaming· vision-models· sports

What we learned building live AI from scratch

Custom RTMP, model benchmarks, and the moment the system did a control room's job.

Some weeks I write about systems. This week we shipped one.

At the BIG3 combine in Vegas, our five-person team ran live AI analytics for the biggest 3-on-3 league in the world. No on-site streaming studio. No turnkey vendor. We built the whole live stack ourselves. Custom RTMP server taking the camera feed, multi-model inference layer running against every frame, event pipeline shipping made shots, fouls, momentum shifts, and highlight cuts in real time.

A few days out, here's what I'm still thinking about.

Why we built our own RTMP server

The honest answer: no turnkey live-streaming vendor was going to let us run inference on the frames as they arrived. Most ingest providers will hand you a delayed mezzanine file after the broadcast ends. We needed pixels in our process, in flight, before they ever reached a viewer.

So we built it. Custom RTMP ingest, GPU-pinned decoder, a frame router that fans out copies to whichever model needs them. The latency budget from camera shutter to first inference output is currently under 400ms. That's the floor for everything that follows.

What we found about models

We tested every reasonable vision model we could get our hands on. Open source first, because the price was zero and the throughput would be ours. Then proprietary, because the leaderboards kept telling us they were better.

The benchmark that mattered most was made-shot accuracy. A made shot looks like a missed shot until the rim interaction, and the rim interaction in a 720p arena broadcast is six pixels wide. The difference between a model that gets this right 88% of the time and one that gets it right 96% of the time is, for our customers, the difference between a usable product and a toy.

We expected open-source models with our own fine-tuning to win. We had the data. We had the team. We had the conviction that proprietary would be overkill.

We were wrong. One of the latest pro-tier models, used with no fine-tuning at all, hit our accuracy bar out of the box. Fine-tuning it on top of that took us into territory we hadn't gotten to with open source after months of work. The math on inference cost was 1.4x what we had budgeted. The math on customer perception when made-shot accuracy crossed 95% was worth several multiples of that.

We left the open-source benchmarks running. They are useful for what they are useful for. Tagging, summarization, post-game analysis where latency and per-frame accuracy aren't both pinned. For the inference that hits the live event pipeline, we are paying for the higher-tier model and not apologizing for it.

That's not a fact I expected to be writing six months ago. And I keep wondering which other "obvious" calls I'm currently getting backwards.

The victory moment

A PLAi happens on the floor. The camera sees it. The RTMP server catches it. The model labels it. In the same second, the event pipeline decides which clip cuts into the highlight reel, which stat updates the leaderboard, which graphic the broadcast operator should consider, and whether the moment is worth pushing to the social queue.

Every one of those used to be a human decision made by a multi-person production crew. At the BIG3 combine, with five of us on the floor and the system running, it was a sequence of small decisions made by software in under a second each.

The first time I watched the pipeline make a call a control room would have made, the floor felt quiet. Not because the team wasn't working, we were. The cadence of all the small connective decisions had just slid underneath us.

Is that what good infrastructure feels like, or is that what being slightly redundant feels like? I'm not sure I can tell from inside the moment.

BIG3 combine, our live AI infrastructure running courtside — What it looked like courtside. Five humans, one full live stack.

What this teaches about the state of vision models

The state of the art in vision is converging faster than the sports tech market is acknowledging.

A year ago, accuracy on this category of fine-grained sports event detection was an open research problem. Today the gap between a budget setup and a top-tier setup is measured in dollars per inference, not in whether the task is solvable at all.

What this means for builders: the moat isn't the model. Anyone with our budget and our test rigs can buy our accuracy level inside a quarter. The moat is the pipeline you've built around the model, the events you've taught it to care about, and the decisions you make on top of its outputs while the game is still being played.

What we're building

Live infrastructure for sports video that runs from camera shutter to operator screen in under a second. Open enough to swap models as the frontier moves. Closed enough that the trade secrets stay ours.

The bet isn't on a single model. It's on the stack underneath it.

I don't fully know yet whether that stack ends up looking like a CDN, a tagging tool, a coach assistant, or something I haven't named. If you're working anywhere in this neighborhood, I want to compare notes.

···

thoughts on this?

← all thoughts home