system 1 · perceiver · vision

Darkfield SEE
— open-vocabulary, type-aware perception for industrial CCTV.

SEE is an industrial vision model built for the angles, lighting, and vocabulary of real CCTV deployments. Similar in spirit to SAM3-class models — purpose-built to distinguish a spiral mixer from a planetary mixer, a rotary filler from a linear filler.

← darkfield THINK · models overview →

/ capabilities

What SEE does,
on every frame.

01 / open-vocabulary detection

Describe what to find. No class list.

Prompted in plain English via THINK. "Spiral mixer" and "planetary mixer" in the same frame produce two distinct, correctly-labelled boxes — on the first frame, without labelled training data.

02 / type-aware recognition

Subtypes, not just classes.

Fine-grained within a detection: lorry → tanker / flatbed / curtain-side; filler → rotary / linear / piston; oven → deck / tunnel. The subtype vocabulary is open — THINK extends it at onboarding.

03 / built-in tracker

Re-ID through occlusion and re-entry.

Persistent IDs survive objects leaving frame and returning. No third-party tracking dependency; the tracker module is trained jointly with the detection head.

04 / segmentation & counting

Pixel-accurate masks; stable totals under crowding.

Mask outputs available alongside boxes. Counting remains stable under motion blur and partial occlusion — the conditions production lines actually produce.

05 / continuously finetuned

Per-camera adapters, hot-swapped at validation.

THINK trains per-camera adapters on model-curated samples. New weights are validated on a held-out slice before they go live. The base model is never modified; adapters are swappable and auditable.

/ benchmarks · DarkfieldOps-300

Evaluated on
production CCTV conditions.

DarkfieldOps-300 is our public evaluation set: 300+ production-line objects across food manufacturing, forecourt, cold-chain, warehouse, and quarry settings. Every object is labelled with both class and subtype. Footage is real CCTV at the angles, resolutions, and lighting of actual deployments. Published under a research licence on research.html.

Headline comparison — DarkfieldOps-300

SEE leads on the type-aware score — the metric that captures the spiral-vs-planetary differentiator — because it was trained on the subtype vocabulary that general-purpose models have never seen. On standard mAP, the gap is smaller; on subtype discrimination, it is decisive.

model	mAP@0.5	mAP@0.5:0.95	ID-F1 (tracking)	type-aware score
Darkfield SEE	0.91	0.74	0.88	0.86
YOLO-26	0.88	0.71	—	0.41
RT-DETR	0.87	0.70	—	0.38
SAM3 (zero-shot)	0.79	0.61	—	0.29
DETR-family (base)	0.82	0.65	—	0.22

// type-aware score = macro-F1 over subtype labels conditioned on correct parent-class detection · — = tracker not evaluated for this model

After per-camera fine-tune — 48h and 7-day gain

Evolution's per-camera adapters produce measurable gains within 48 hours of deployment and continue improving through the first week. Numbers below are indicative across recent partner sites; individual results vary by scene complexity and lighting conditions.

checkpoint	mAP@0.5	mAP@0.5:0.95	recall	type-aware score
baseline (onboarding)	0.84	0.66	0.81	0.76
after 48h retrain	0.89	0.71	0.91	0.83
after 7-day retrain	0.94	0.78	0.96	0.90

Per-frame latency

Measured on the deployment hardware we recommend (NVIDIA T4 class). Sub-50ms at 1080p batch-1 is the design target. Latency scales approximately linearly with batch size at small batches.

resolution	batch 1	batch 4	batch 8
720p	28ms	31ms / frame	34ms / frame
1080p	44ms	48ms / frame	53ms / frame
4K	112ms	119ms / frame	131ms / frame

// T4-class GPU · detection + tracking composite · with per-camera adapter loaded · 4K uses tile-and-merge

/ open-vocabulary detection

Prompted in plain English.
Type-aware on the first frame.

Each row shows the prompt THINK issued and the result SEE produced. Prompts are drawn from real partner deployments. Images are from the DarkfieldOps-300 evaluation set.

"spiral mixer"

Spiral mixer highlighted correctly. The planetary mixer in the same bakery line frame is not selected — correct subtype discrimination on first pass, zero-shot.

"planetary mixer"

Inverse of above — same frame, planetary mixer highlighted, spiral mixer not selected. Demonstrates the subtype boundary is stable across both classes simultaneously.

"rotary filler"

Rotary filler on the bottling line highlighted. The linear filler upstream on the same conveyor is not selected. Relevant because both machines are physically similar at low resolution.

"operator wearing high-vis"

Factory line frame — two operators in high-vis tabards highlighted with persistent IDs (OP·0117, OP·0231). A visitor in plain clothes on the same frame is correctly excluded.

"forklift carrying a pallet"

Relational prompt — only the loaded forklift is masked. An empty forklift visible in the same dock frame is correctly excluded. Demonstrates compositional understanding of object state.

"any vehicle inside the staging zone"

Spatial prompt — only the vehicle inside the defined polygon is highlighted. Vehicles parked outside the zone in the same yard frame are not selected.

"compressor that has stopped moving"

Temporal prompt — the idle compressor is highlighted; the running one in the adjacent bay is not. State is inferred from motion history, not a separate classification head.

"any number plate visible"

All visible plates in the forecourt frame highlighted, each with an OCR readout overlaid. Handles partial occlusion and oblique angles — conditions typical of entry/exit cameras.

"pallet stationary for more than two hours"

Composite spatial + temporal prompt. Pallets exceeding the dwell threshold are highlighted with a timer overlay. Integrates tracking history; requires no additional rule engine.

// images from DarkfieldOps-300 · partner footage used with permission · identifiers anonymised

/ architecture

Architecture overview.

SEE is a vision transformer with a detection and segmentation head, a text prompt encoder, and a jointly-trained tracker module. The backbone is trained on industrial CCTV corpora from the ground up — not fine-tuned from a general-purpose vision model. The segmentation head follows a SAM-class decoder architecture adapted for the aspect ratios and scene density of overhead and angled CCTV views.

Per-camera adapters are attached at the last two backbone layers. THINK launches the finetune autonomously and hot-swaps the new weights on validation without stopping inference.

/ training data

What SEE was trained on.

Pre-training data is sourced under partner agreements. Synthetic supplementation is used for rare subtypes and extreme lighting conditions. No consumer-PII frames are used at any stage. Verticals covered with confidence in the current release:

→Forecourt — vehicles, plates, pumps, canopy
→Food manufacturing — mixers, fillers, ovens, conveyors
→Cold store & cold chain — pallets, compressors, dock doors
→Warehouse yard — vehicles, forklifts, staging zones
→Quarry aggregates — heavy plant, conveyors, stockpiles

/ limitations

What SEE
won't do.

small & far objects

Objects below roughly 16×16 pixels at inference resolution are unreliable. Mitigated by THINK recommending closer camera placement at onboarding.

extreme weather

Heavy rain, dense fog, and direct sun flare degrade detection quality. SEE reports a confidence signal per-frame that THINK uses to decide whether to alert or hold.

novel verticals before per-camera fine-tune

Zero-shot performance on verticals outside the training distribution is weaker. The 48-hour onboarding period with human verification is designed specifically to cover this gap.

face recognition

Not implemented, by policy. SEE assigns persistent IDs to people without identifying them biometrically. This is a hard boundary, not a capability gap.

audio

Out of scope. SEE is a vision model. THINK's voice calls are synthesised output, not acoustic input — microphone feeds are not processed.

/ specification

Model card.

model	Darkfield SEE
role	Perceiver · data plane · per-camera detection, tracking, segmentation, OCR, counting
system	System 1 — runs on every frame at line rate
modality	Vision only — RTSP frames at 720p / 1080p / 4K
prompt format	Text prompt from THINK at pipeline composition time; not per-frame
output format	Bounding boxes · masks · track IDs · OCR strings · count totals · zone events
latency	<50ms per frame at 1080p on T4-class hardware
eval set	DarkfieldOps-300 — public, research licence — research.html
params	undisclosed in private beta
availability	Private beta · partner access only · edge deployment supported

// citations and linked papers → research.html#papers

Run SEE against
your cameras.

We're onboarding a small number of partners in private beta.

see the capabilities →

Darkfield SEE— open-vocabulary, type-aware perception for industrial CCTV.

What SEE does,on every frame.