
Overshoot
AI infrastructure for real-time vision applications.
<200ms
Inference latency
~10× faster than existing platforms
1,000+
Developers
Engaged at W26 launch
$58B
Computer vision TAM
By 2030 · Grand View Research
Thesis
- 01
Real-time vision is a whole-stack constraint, not a model toggle. Latency budgets are fragile and cumulative across transport, sampling, inference throughput, and output length. Production systems must explicitly manage these budgets. Overshoot's SDK surfaces the levers — sampling modes, output token caps, latency callbacks — and ties them to perceived latency.[4] [5]
- 02
VLM-first UX compresses iteration vs. bespoke CV. Traditional computer vision demanded labeling and training per use case. Vision-Language Models collapse most of that to a prompt with a schema. Overshoot's "prompt-as-program" interface aligns with how agentic, interactive applications are actually built today.[1] [2]
- 03
Hyperscalers validate the category but stop at model I/O. Gemini Live and OpenAI Realtime ship the model interface; developers still need transport orchestration, sampling policy, model routing, and reliability/observability to hit SLAs. Overshoot packages those missing layers — specifically for vision.[1] [2] [3]
- 04
The Hjouji brothers are the canonical team for this stack. Zakaria built GPU kernels at Meta AI and low-latency surge pricing systems at Uber. Younes was a founding engineer at Cosmonio (acquired by Intel) where he built a CV training and serving platform from scratch. Inference engines, low-latency systems, and applied computer vision — assembled in one cofounding pair.
Problem
AI can now see and understand the physical world. Building on top of it is still painful.
Vision-Language Models have unlocked real applications in physical security, safety, gaming, robotics, and consumer products. Soon, video agents will watch your home and your pet when you're away. The model side of that future is already here.
The infrastructure side is not. Developers building real-time vision applications face three compounding problems: slow inference, limited model availability, and infrastructure that breaks at scale.[13] Existing inference platforms were designed around text — they treat image and video as awkward attachments rather than first-class modalities, and they leave transport, sampling, and stream lifecycle entirely to the developer.
The result: every team building a video agent ends up re-inventing the same stack from WebRTC ingest down through sampling policy and output budgeting. The work isn't novel and it isn't differentiated. It's plumbing — and it's preventing the applications from ever shipping.
<200ms
Overshoot end-to-end
Live video → VLM response
10×
Latency improvement
Vs. existing inference platforms
3 lines
Of code
To connect live video to a VLM
Why Now
The model layer just got real. The infra layer hasn't caught up.
Three converging shifts make the whole-stack vision problem solvable for the first time — and create a discrete window before hyperscalers extend their APIs downward.
The base models work. The transport doesn't.
Hyperscalers standardized low-latency model I/O. Gemini Live[1] and OpenAI Realtime[2] shipped streaming multimodal interfaces in the last 12 months. The model boundary is now a solved problem with documented latency targets.
WebRTC is the de facto transport. WebRTC[4] and LiveKit[3] have hardened into the default real-time media stack. Battle-tested SFUs, agents frameworks,[14] and reconnect semantics now exist — but they're general-purpose, not vision-specific.
The layer in between is still missing. What hasn't been built is the piece that takes a live feed, samples it intelligently against a latency budget, routes it to the right VLM, enforces a schema on the output, and survives jitter. That's the gap Overshoot fills.[5] [13]
Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.
How It Works
Three lines of code. Sub-200ms responses. Schema-checked output.
Latency is a first-class API primitive.
Sampled inference by design. Most real-time vision workloads are event-driven, not continuous. Overshoot exposes targetFps, clip length, clip delay, and interval_seconds as explicit parameters — so developers trade thoroughness for latency budget at the API surface rather than discovering the limits in production.[13]
Model surface and routing. Overshoot hosts the largest collection of Vision-Language Models behind a single API, with a "gemini" passthrough backend when direct model access is preferred.[1] Schema enforcement supports structured outputs for downstream systems — no parsing, no half-formed JSON, no retries.[13]
Reliability primitives. Stream lifecycle and reconnect semantics, observability hooks, and latency-aware callbacks keep developers inside their latency budgets even on imperfect networks — the "last mile" work that vision teams otherwise own end-to-end.[5]
Zero infra headache. Developers connect live video feeds to VLMs with 3 lines of code and get responses in less than 200ms — roughly 10× faster than any existing inference platform. The interface itself is a tell: it exposes the real production constraints, not the demo path.[13]
Market
Enterprises already spend tens of billions turning video into operational signals.
Video analytics software is on track from ~$12.7B (2024) to ~$37.8B (2030).[6] Video surveillance — hardware, software, and services — moves from ~$73.8B to ~$147.7B over the same period.[7] Computer vision overall: ~$19.8B to ~$58.3B.[8]
These markets are already monetized. What's changing is how the applications get built. The previous generation required custom models and bespoke deployments per camera per use case. VLMs collapse that to a prompt — but only if the infrastructure underneath can handle live streams. Overshoot is the developer-infrastructure category that makes the next generation of these applications buildable in days instead of quarters.
The right pricing shape already exists.
Streaming workloads are event/sampling-driven, not continuous 24/7 — even when the camera is always on. AWS's Rekognition Streaming Video Events architecture and per-minute pricing[9] is the existence proof: revenue scales with minutes analyzed, not wall-clock stream time. Overshoot's event-driven sampling design lines up directly with that billing shape, which means margin discipline is built into the product, not bolted on.
Initial ICPs: physical security and monitoring, QA and inspection, robotics and tele-operations, and interactive consumer products. The common thread: latency is a hard requirement, cameras or WebRTC sources already exist, and the application logic is "show the VLM what's happening, structure the response, act on it."[1] [2]
Soon, video agents will watch your home and your pet when you're away. AI can see and understand the physical world. This unlocks new applications in physical security, safety, gaming, robotics and general consumer products.
Competitive landscape
Six categories of competition. Overshoot is the only one purpose-built for live VLM inference.
Each adjacent category solves a real problem — but none of them solves Overshoot's. Transport without inference, batch without live, prebuilt detectors without prompting, model APIs without lifecycle.
Our moat is focus. Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.
Traction
Developer pull at W26 launch.
1,000+ developers on the platform. 10× faster than anything else.
Company materials cite 1,000+ developers engaged with the platform at YC W26 launch.[13] Public docs lead with the "3 lines to live video → VLM" demo and LiveKit room ingest.[3] Responses arrive in under 200ms — roughly 10× faster than existing inference platforms for comparable workloads.
Early adoption is concentrated in the ICPs the product is built for: video agents in gaming, robotics, and physical security. The interface itself is a tell — sampling knobs, output token budgets, latency callbacks. This is what an API designed by people who have run video inference at scale looks like, not what a demo path looks like.[13]
Founder deep dive
The exact two backgrounds you would assemble to build this.
Founders & team
Strategic advantages & gaps
Where the moat compounds — and where it has to keep being earned.
Advantages
Video-native architecture
Samples and clips are first-class objects, output tokens are explicitly budgeted, and transport is integrated for real-time SLAs.
Reliability & ergonomics
Stream lifecycle handling and latency-aware callbacks absorb the "last mile" work developers otherwise own end-to-end.
Model flexibility
Hosted VLMs plus Gemini-class passthrough preserve developer choice without losing the reliability and observability layer.
Gaps to earn
Continuous latency optimization
Keeping the 10× lead requires constant work on GPU packing, fairness, and head-of-line blocking. There is no resting state.
Hyperscaler overlap
As base APIs add video features, Overshoot has to stay clearly better on reliability, observability, and workflow shape — not just speed.
Enterprise deployment surfaces
Many security and inspection buyers require on-prem or edge. A hybrid story has to develop alongside the cloud-first SDK.
Risks & mitigations
What we're watching
References
- [1]Google — Gemini API: Live (Multimodal streaming)
- [2]OpenAI — Realtime API Guide
- [3]LiveKit — Docs (WebRTC real-time media platform)
- [4]WebRTC — Overview
- [5]Latent Space — The Realtime AI Playbook (latency budgets)
- [6]Grand View Research — Video Analytics Market
- [7]Grand View Research — Video Surveillance Market
- [8]Grand View Research — Computer Vision Market
- [9]AWS — Rekognition Streaming Video Events (architecture/pricing model)
- [10]Roboflow — Inference (deployment + streaming)
- [11]Twelve Labs — Product Overview
- [12]Coactive AI — Platform Overview
- [13]Overshoot — Website / Docs
- [14]LiveKit — Agents Framework


