models

Two models.
One thinks, one looks.

Darkfield is built on a deliberate split: a multimodal planner that reasons across minutes, and a vision perceiver that responds in under fifty milliseconds. Neither model is general-purpose. Both were designed for the same problem.

/ architecture

Planner and perceiver,
designed together.

system 2 planner multimodal
Darkfield THINK

Reads documentation, inspects streams, authors pipelines, supervises SEE. Runs when reasoning is needed — not on every frame.

  • Conversational planning — ingests PDFs, SOPs, KPIs to negotiate a schema before detection begins.
  • Live video understanding — inspects feeds and recommends repositioning when a camera is insufficient.
  • Pipeline authorship — composes detect / track / OCR / count / zone steps into a runnable graph.
  • Spatial grounding — resolves natural language to pixel coordinates on the image.
  • Finetune controller — evaluates SEE's output, curates training pairs, decides retrain vs. model-swap.
orchestrates
system 1 perceiver vision
Darkfield SEE

Open-vocabulary, type-aware industrial perception. Sub-fifty-millisecond per-frame latency. Continuously finetuned, per-camera, by THINK.

  • Open-vocabulary detection — describe what to find in plain English; no class list required.
  • Type-aware recognition — distinguishes a spiral mixer from a planetary mixer on the first frame.
  • Built-in tracker — re-ID through occlusion and frame re-entry, no third-party dependency.
  • Segmentation and counting — pixel-accurate masks; stable totals under crowding and motion blur.
  • Continuously finetuned — per-camera adapters trained on model-curated samples, hot-swapped at validation.

// THINK is too expensive to run per-frame. SEE is too narrow to plan. They were designed together precisely because the split is the feature.

/ design rationale

Why two models,
not one?

The trade-off between reasoning capacity and per-frame latency makes a single-model design strictly worse than the split. A multimodal LLM can't run at fifty milliseconds per frame; a frame-rate vision model can't read an SOP or author a schema. Running a large model on every frame would cost three orders of magnitude more and still produce worse plans.

The split also gives you a natural separation of trust: SEE handles the data plane (raw video, bounding boxes, labels) while THINK handles the control plane (planning, evaluation, retraining decisions, outbound alerts). That boundary makes both models auditable, independently replaceable, and easier to reason about in a compliance context.

THINK
roleplanner · orchestrator
latency targetseconds–minutes
input modalitytext · image · video · docs
fires whenon planning events
SEE
roleperceiver · data plane
latency target<50ms per frame
input modalityvision only
fires whenon every frame

Ready to see the models
against your streams?

We're onboarding a small number of partners in private beta.

read the technical brief