Eight things every CCTV deployment needs,
in one AI.
Detection, tracking, segmentation, counting, OCR, heatmaps, classification, zones — and you ask for them in plain English. No class lists to pick from. No labels to draw.
Eight building blocks.
One model runs them all.
Open-vocabulary, type-aware.
Prompted in plain language — no class list required. SEE distinguishes a spiral mixer from a planetary mixer in the same frame on the first pass. Adding a new object class is a sentence, not a sprint.
Persistent identity, role inference, no biometrics.
Operators and visitors distinguished by role cue (high-vis vs. plain clothes), assigned persistent IDs by appearance and trajectory. No biometrics by default; face recognition opt-in per deployment, under UK-GDPR.
Re-ID through occlusion and frame re-entry.
Built-in tracker trained jointly with the detection head. A forklift's path is drawn as a stable polyline across a dock, ID intact through an occlusion event. No third-party tracking library required.
Pixel-accurate masks, not just boxes.
Masks available for every detected object. Useful for exact pallet area calculations, contamination detection, and fill-level estimation. SAM-class decoder adapted for CCTV aspect ratios and scene density.
Stable totals under crowding and motion blur.
Count chip overlaid per-frame. Robust against the conditions industrial scenes actually produce — tight stacking, fast movement, and partial occlusion at the edges of frame. Shift totals accumulated automatically.
Calibrated for CCTV angles and lighting.
Number plate recognition, label reading, and asset ID capture. Handles oblique angles, partial shadow, and the motion blur typical of entry/exit cameras. OCR output is attached to the detection event row.
Fine-grained subtype within a detection.
Lorry → tanker / flatbed / curtain-side. Filler → rotary / linear / piston. The subtype vocabulary is open — the AI extends it at onboarding with a plain-language description. No additional labelled data required for the first pass.
Polygons in real-world coordinates; entry, dwell, violation.
Zones are defined in plain language — "the staging area near dock B" — and grounded to pixel coordinates by the AI. Entry events, dwell timers, and violation flags are emitted as structured rows to Dashboards.
Capabilities combined
into higher-order outputs.
Most operational questions are answered not by a single capability, but by a chain. The AI builds these automatically from your prompt.
common compositions
// compositions are authored by the AI at onboarding time — not configured by a user in a UI tree.
Per-frame budget
on the Linox AI vision-box.
| capability | latency @1080p | notes |
|---|---|---|
| detect objects | 38ms | base detection head · open-vocabulary |
| detect people | 38ms | shared backbone with object detection · parallel |
| track | +4ms | incremental; re-ID pass only on new objects |
| segment | +6ms | mask head; runs on detected boxes only |
| count | <1ms | aggregation over detection outputs · negligible |
| OCR | +8ms | crop-and-pass on detected text regions only |
| classify (subtype) | +3ms | classification head on existing crop |
| zones | <1ms | polygon intersection test · CPU-side |
| full composition (detect + track + OCR + zones) | 44ms | typical production pipeline · within 50ms budget |
// measured on the Linox AI vision-box that ships with every deployment · FP16 · 1080p · per-camera adapter loaded · batch-1 single-stream baseline
What detection
won't do.
We'd rather be clear about the edges of the capability now than have you discover them in production.
Off by default. Detection assigns persistent IDs from appearance and trajectory, no biometrics. Available as an opt-in per deployment under a UK-GDPR-compliant agreement — for example, authorised-access verification or persistent identity across cameras — but never the default.
Limited — coarse gestures (raised hand, pointing) are possible with per-camera fine-tune; fine-grained sign language or nuanced interaction is outside the current capability.
Out of scope. SEE is a vision model. Microphone feeds are not processed at any stage.
Objects below ~16×16 pixels at inference resolution are unreliable. The AI flags this at stream inspection time and recommends a closer camera if the task requires it.
Eight capabilities,
running against your streams.
We're onboarding a small number of partners in private beta.