Researchers introduced SpatialBench-Long, a comprehensive benchmark testing AI agents' ability to conduct end-to-end scientific reasoning on complex spatial biology data without prescribed methods. The benchmark spans 24 evaluations across multiple cancer and aging systems using diverse measurement technologies, with current leading models achieving only 11.1% success rate, revealing significant limitations in AI's capacity for autonomous biological discovery.
SpatialBench-Long addresses a critical gap in AI evaluation frameworks by moving beyond narrow procedural competency testing toward genuine scientific reasoning. Unlike existing benchmarks that assess isolated analysis steps or broad biological knowledge, this framework requires agents to recover meaningful biological claims from raw experimental data—a task demanding integration across multiple data modalities, domain expertise, and inferential reasoning. The benchmark encompasses pancreatic cancer, glioblastoma organoids, lung adenocarcinoma lineage tracing, and aging studies, employing eight different spatial measurement technologies including CosMx, Visium, and MERFISH alongside traditional sequencing approaches.
The results are sobering: three model-harness combinations (Gemini 3.5 Flash, GPT-5.5 with different coding interfaces) achieved only 8/72 successful runs (11.1%), indicating substantial gaps between current AI capabilities and autonomous scientific discovery. This finding reflects the complexity of spatial biology—where understanding cellular relationships, anatomical context, and temporal dynamics requires reasoning far beyond pattern recognition.
For the AI research community, these results establish critical baseline expectations for genuine scientific AI. The deterministic grading system and controlled vocabularies enable reproducible evaluation that transcends subjective assessment. For biotech and pharmaceutical sectors, the low current performance suggests AI agents cannot yet replace expert scientists in complex spatial analysis, though the benchmark itself provides a roadmap for identifying and closing capability gaps. As models improve, this framework enables rigorous validation of progress toward autonomous biological discovery rather than claiming capabilities based on superficial benchmarks.
- →Current leading AI models achieve only 11.1% success on complex spatial biology reasoning tasks, revealing significant capabilities gaps
- →SpatialBench-Long tests end-to-end scientific reasoning with raw data rather than isolated procedural steps, providing more realistic AI assessment
- →The benchmark covers diverse cancer and aging systems across eight different spatial measurement technologies, ensuring broad applicability
- →Deterministic grading with controlled vocabularies enables reproducible evaluation of AI scientific reasoning abilities
- →Results suggest AI agents cannot yet autonomously conduct expert-level spatial biology analysis without human guidance