DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
Researchers introduce DISSECT, a 12,000-question diagnostic benchmark that reveals a critical "perception-integration gap" in Vision-Language Models—where VLMs successfully extract visual information but fail to reason about it during downstream tasks. Testing 18 VLMs across Chemistry and Biology shows open-source models systematically struggle with integrating visual input into reasoning, while closed-source models demonstrate superior integration capabilities.
The DISSECT benchmark addresses a fundamental blind spot in how Vision-Language Models are evaluated. Traditional accuracy metrics conflate perception with reasoning, masking failures that occur after visual information is successfully extracted. This distinction matters because a model that can describe a benzene ring but miscalculates its properties appears to have comprehensive multimodal understanding when tested with aggregate scores.
The research reveals a significant gap between open-source and closed-source VLM architectures. Open-source models show measurable performance improvements when reasoning from their own verbalized descriptions of images rather than raw visual input, suggesting integration bottlenecks in their pipeline. Closed-source models like those from major technology companies show no such gap, indicating they have solved this architectural challenge. This divergence represents a concrete competitive advantage for proprietary systems and highlights where open-source development needs focused investment.
The chemistry-biology split is particularly revealing. Chemistry exhibits lower language-prior exploitability, meaning models cannot simply rely on linguistic patterns to answer questions—they genuinely need visual reasoning. This makes chemistry tasks more diagnostic for genuine multimodal capability than biology questions, where language priors alone provide stronger performance baselines.
The Model Oracle protocol enables post-hoc diagnosis of any VLM, making this framework immediately applicable across the industry. As VLM deployment accelerates in scientific research, education, and professional applications, identifying and fixing integration failures becomes critical. Organizations relying on these models for molecular design, biological analysis, or other visual-reasoning tasks need diagnostic tools to assess whether performance improvements stem from genuine capability gains or artifact exploitation.
- →Open-source VLMs show systematic integration failures where they perform better reasoning from their own descriptions than from raw images, while closed-source models show no such gap.
- →Chemistry tasks reveal genuine visual reasoning capability better than biology tasks because they exploit language priors less effectively.
- →The perception-integration gap is invisible to standard benchmarks, making traditional accuracy metrics insufficient for evaluating multimodal reasoning quality.
- →DISSECT's Model Oracle protocol provides a replicable diagnostic method applicable to any VLM to decompose performance into distinct capability components.
- →Integration capability appears to be the primary frontier differentiating closed-source from open-source VLM performance in scientific applications.