Darkfield SEE
— open-vocabulary, type-aware perception for industrial CCTV.
SEE is an industrial vision model built for the angles, lighting, and vocabulary of real CCTV deployments. Similar in spirit to SAM3-class models — purpose-built to distinguish a spiral mixer from a planetary mixer, a rotary filler from a linear filler.
What SEE does,
on every frame.
Describe what to find. No class list.
Prompted in plain English via THINK. "Spiral mixer" and "planetary mixer" in the same frame produce two distinct, correctly-labelled boxes — on the first frame, without labelled training data.
Subtypes, not just classes.
Fine-grained within a detection: lorry → tanker / flatbed / curtain-side; filler → rotary / linear / piston; oven → deck / tunnel. The subtype vocabulary is open — THINK extends it at onboarding.
Re-ID through occlusion and re-entry.
Persistent IDs survive objects leaving frame and returning. No third-party tracking dependency; the tracker module is trained jointly with the detection head.
Pixel-accurate masks; stable totals under crowding.
Mask outputs available alongside boxes. Counting remains stable under motion blur and partial occlusion — the conditions production lines actually produce.
Per-camera adapters, hot-swapped at validation.
THINK trains per-camera adapters on model-curated samples. New weights are validated on a held-out slice before they go live. The base model is never modified; adapters are swappable and auditable.
Evaluated on
production CCTV conditions.
DarkfieldOps-300 is our public evaluation set: 300+ production-line objects across food manufacturing, forecourt, cold-chain, warehouse, and quarry settings. Every object is labelled with both class and subtype. Footage is real CCTV at the angles, resolutions, and lighting of actual deployments. Published under a research licence on research.html.
Headline comparison — DarkfieldOps-300
SEE leads on the type-aware score — the metric that captures the spiral-vs-planetary differentiator — because it was trained on the subtype vocabulary that general-purpose models have never seen. On standard mAP, the gap is smaller; on subtype discrimination, it is decisive.
| model | mAP@0.5 | mAP@0.5:0.95 | ID-F1 (tracking) | type-aware score |
|---|---|---|---|---|
| Darkfield SEE | 0.91 | 0.74 | 0.88 | 0.86 |
| YOLO-26 | 0.88 | 0.71 | — | 0.41 |
| RT-DETR | 0.87 | 0.70 | — | 0.38 |
| SAM3 (zero-shot) | 0.79 | 0.61 | — | 0.29 |
| DETR-family (base) | 0.82 | 0.65 | — | 0.22 |
// type-aware score = macro-F1 over subtype labels conditioned on correct parent-class detection · — = tracker not evaluated for this model
After per-camera fine-tune — 48h and 7-day gain
Evolution's per-camera adapters produce measurable gains within 48 hours of deployment and continue improving through the first week. Numbers below are indicative across recent partner sites; individual results vary by scene complexity and lighting conditions.
| checkpoint | mAP@0.5 | mAP@0.5:0.95 | recall | type-aware score |
|---|---|---|---|---|
| baseline (onboarding) | 0.84 | 0.66 | 0.81 | 0.76 |
| after 48h retrain | 0.89 | 0.71 | 0.91 | 0.83 |
| after 7-day retrain | 0.94 | 0.78 | 0.96 | 0.90 |
Per-frame latency
Measured on the deployment hardware we recommend (NVIDIA T4 class). Sub-50ms at 1080p batch-1 is the design target. Latency scales approximately linearly with batch size at small batches.
| resolution | batch 1 | batch 4 | batch 8 |
|---|---|---|---|
| 720p | 28ms | 31ms / frame | 34ms / frame |
| 1080p | 44ms | 48ms / frame | 53ms / frame |
| 4K | 112ms | 119ms / frame | 131ms / frame |
// T4-class GPU · detection + tracking composite · with per-camera adapter loaded · 4K uses tile-and-merge
Prompted in plain English.
Type-aware on the first frame.
Each row shows the prompt THINK issued and the result SEE produced. Prompts are drawn from real partner deployments. Images are from the DarkfieldOps-300 evaluation set.
"spiral mixer""planetary mixer""rotary filler""operator wearing high-vis""forklift carrying a pallet""any vehicle inside the staging zone""compressor that has stopped moving""any number plate visible""pallet stationary for more than two hours"// images from DarkfieldOps-300 · partner footage used with permission · identifiers anonymised
Architecture overview.
SEE is a vision transformer with a detection and segmentation head, a text prompt encoder, and a jointly-trained tracker module. The backbone is trained on industrial CCTV corpora from the ground up — not fine-tuned from a general-purpose vision model. The segmentation head follows a SAM-class decoder architecture adapted for the aspect ratios and scene density of overhead and angled CCTV views.
Per-camera adapters are attached at the last two backbone layers. THINK launches the finetune autonomously and hot-swaps the new weights on validation without stopping inference.
What SEE was trained on.
Pre-training data is sourced under partner agreements. Synthetic supplementation is used for rare subtypes and extreme lighting conditions. No consumer-PII frames are used at any stage. Verticals covered with confidence in the current release:
- →Forecourt — vehicles, plates, pumps, canopy
- →Food manufacturing — mixers, fillers, ovens, conveyors
- →Cold store & cold chain — pallets, compressors, dock doors
- →Warehouse yard — vehicles, forklifts, staging zones
- →Quarry aggregates — heavy plant, conveyors, stockpiles
What SEE
won't do.
Objects below roughly 16×16 pixels at inference resolution are unreliable. Mitigated by THINK recommending closer camera placement at onboarding.
Heavy rain, dense fog, and direct sun flare degrade detection quality. SEE reports a confidence signal per-frame that THINK uses to decide whether to alert or hold.
Zero-shot performance on verticals outside the training distribution is weaker. The 48-hour onboarding period with human verification is designed specifically to cover this gap.
Not implemented, by policy. SEE assigns persistent IDs to people without identifying them biometrically. This is a hard boundary, not a capability gap.
Out of scope. SEE is a vision model. THINK's voice calls are synthesised output, not acoustic input — microphone feeds are not processed.
Model card.
| model | Darkfield SEE |
|---|---|
| role | Perceiver · data plane · per-camera detection, tracking, segmentation, OCR, counting |
| system | System 1 — runs on every frame at line rate |
| modality | Vision only — RTSP frames at 720p / 1080p / 4K |
| prompt format | Text prompt from THINK at pipeline composition time; not per-frame |
| output format | Bounding boxes · masks · track IDs · OCR strings · count totals · zone events |
| latency | <50ms per frame at 1080p on T4-class hardware |
| eval set | DarkfieldOps-300 — public, research licence — research.html |
| params | undisclosed in private beta |
| availability | Private beta · partner access only · edge deployment supported |
// citations and linked papers → research.html#papers
Run SEE against
your cameras.
We're onboarding a small number of partners in private beta.