system 1 · perceiver · vision

Darkfield SEE
— open-vocabulary, type-aware perception for industrial CCTV.

SEE is an industrial vision model built for the angles, lighting, and vocabulary of real CCTV deployments. Similar in spirit to SAM3-class models — purpose-built to distinguish a spiral mixer from a planetary mixer, a rotary filler from a linear filler.

← darkfield THINK · models overview →
/ capabilities

What SEE does,
on every frame.

01 / open-vocabulary detection

Describe what to find. No class list.

Prompted in plain English via THINK. "Spiral mixer" and "planetary mixer" in the same frame produce two distinct, correctly-labelled boxes — on the first frame, without labelled training data.

02 / type-aware recognition

Subtypes, not just classes.

Fine-grained within a detection: lorry → tanker / flatbed / curtain-side; filler → rotary / linear / piston; oven → deck / tunnel. The subtype vocabulary is open — THINK extends it at onboarding.

03 / built-in tracker

Re-ID through occlusion and re-entry.

Persistent IDs survive objects leaving frame and returning. No third-party tracking dependency; the tracker module is trained jointly with the detection head.

04 / segmentation & counting

Pixel-accurate masks; stable totals under crowding.

Mask outputs available alongside boxes. Counting remains stable under motion blur and partial occlusion — the conditions production lines actually produce.

05 / continuously finetuned

Per-camera adapters, hot-swapped at validation.

THINK trains per-camera adapters on model-curated samples. New weights are validated on a held-out slice before they go live. The base model is never modified; adapters are swappable and auditable.

/ benchmarks · DarkfieldOps-300

Evaluated on
production CCTV conditions.

DarkfieldOps-300 is our public evaluation set: 300+ production-line objects across food manufacturing, forecourt, cold-chain, warehouse, and quarry settings. Every object is labelled with both class and subtype. Footage is real CCTV at the angles, resolutions, and lighting of actual deployments. Published under a research licence on research.html.

Headline comparison — DarkfieldOps-300

SEE leads on the type-aware score — the metric that captures the spiral-vs-planetary differentiator — because it was trained on the subtype vocabulary that general-purpose models have never seen. On standard mAP, the gap is smaller; on subtype discrimination, it is decisive.

model mAP@0.5 mAP@0.5:0.95 ID-F1 (tracking) type-aware score
Darkfield SEE 0.91 0.74 0.88 0.86
YOLO-26 0.88 0.71 0.41
RT-DETR 0.87 0.70 0.38
SAM3 (zero-shot) 0.79 0.61 0.29
DETR-family (base) 0.82 0.65 0.22

// type-aware score = macro-F1 over subtype labels conditioned on correct parent-class detection · — = tracker not evaluated for this model

After per-camera fine-tune — 48h and 7-day gain

Evolution's per-camera adapters produce measurable gains within 48 hours of deployment and continue improving through the first week. Numbers below are indicative across recent partner sites; individual results vary by scene complexity and lighting conditions.

checkpoint mAP@0.5 mAP@0.5:0.95 recall type-aware score
baseline (onboarding) 0.84 0.66 0.81 0.76
after 48h retrain 0.89 0.71 0.91 0.83
after 7-day retrain 0.94 0.78 0.96 0.90

Per-frame latency

Measured on the deployment hardware we recommend (NVIDIA T4 class). Sub-50ms at 1080p batch-1 is the design target. Latency scales approximately linearly with batch size at small batches.

resolution batch 1 batch 4 batch 8
720p 28ms 31ms / frame 34ms / frame
1080p 44ms 48ms / frame 53ms / frame
4K 112ms 119ms / frame 131ms / frame

// T4-class GPU · detection + tracking composite · with per-camera adapter loaded · 4K uses tile-and-merge

/ open-vocabulary detection

Prompted in plain English.
Type-aware on the first frame.

Each row shows the prompt THINK issued and the result SEE produced. Prompts are drawn from real partner deployments. Images are from the DarkfieldOps-300 evaluation set.

// images from DarkfieldOps-300 · partner footage used with permission · identifiers anonymised

/ architecture

Architecture overview.

SEE is a vision transformer with a detection and segmentation head, a text prompt encoder, and a jointly-trained tracker module. The backbone is trained on industrial CCTV corpora from the ground up — not fine-tuned from a general-purpose vision model. The segmentation head follows a SAM-class decoder architecture adapted for the aspect ratios and scene density of overhead and angled CCTV views.

Per-camera adapters are attached at the last two backbone layers. THINK launches the finetune autonomously and hot-swaps the new weights on validation without stopping inference.

/ training data

What SEE was trained on.

Pre-training data is sourced under partner agreements. Synthetic supplementation is used for rare subtypes and extreme lighting conditions. No consumer-PII frames are used at any stage. Verticals covered with confidence in the current release:

  • Forecourt — vehicles, plates, pumps, canopy
  • Food manufacturing — mixers, fillers, ovens, conveyors
  • Cold store & cold chain — pallets, compressors, dock doors
  • Warehouse yard — vehicles, forklifts, staging zones
  • Quarry aggregates — heavy plant, conveyors, stockpiles
/ limitations

What SEE
won't do.

small & far objects

Objects below roughly 16×16 pixels at inference resolution are unreliable. Mitigated by THINK recommending closer camera placement at onboarding.

extreme weather

Heavy rain, dense fog, and direct sun flare degrade detection quality. SEE reports a confidence signal per-frame that THINK uses to decide whether to alert or hold.

novel verticals before per-camera fine-tune

Zero-shot performance on verticals outside the training distribution is weaker. The 48-hour onboarding period with human verification is designed specifically to cover this gap.

face recognition

Not implemented, by policy. SEE assigns persistent IDs to people without identifying them biometrically. This is a hard boundary, not a capability gap.

audio

Out of scope. SEE is a vision model. THINK's voice calls are synthesised output, not acoustic input — microphone feeds are not processed.

/ specification

Model card.

model Darkfield SEE
role Perceiver · data plane · per-camera detection, tracking, segmentation, OCR, counting
system System 1 — runs on every frame at line rate
modality Vision only — RTSP frames at 720p / 1080p / 4K
prompt format Text prompt from THINK at pipeline composition time; not per-frame
output format Bounding boxes · masks · track IDs · OCR strings · count totals · zone events
latency <50ms per frame at 1080p on T4-class hardware
eval set DarkfieldOps-300 — public, research licence — research.html
params undisclosed in private beta
availability Private beta · partner access only · edge deployment supported

// citations and linked papers → research.html#papers

Run SEE against
your cameras.

We're onboarding a small number of partners in private beta.

see the capabilities →