AINeutralarXiv – CS AI · 3h ago6/10
🧠
Verifiable Benchmarking of Long-Horizon Spatial Biology
Researchers introduced SpatialBench-Long, a comprehensive benchmark testing AI agents' ability to conduct end-to-end scientific reasoning on complex spatial biology data without prescribed methods. The benchmark spans 24 evaluations across multiple cancer and aging systems using diverse measurement technologies, with current leading models achieving only 11.1% success rate, revealing significant limitations in AI's capacity for autonomous biological discovery.
🏢 OpenAI🧠 GPT-5🧠 Gemini