Aiming language at pixels.
Resolving a noun phrase like "the third valve from the left" into a pixel cluster, with industrial-grade precision. THINK's grounding head is distilled from an internal annotation corpus we'd never use for training SEE directly.