AINeutralarXiv – CS AI · 7h ago6/10
🧠
Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery
Researchers introduce Auto-Discovery-Bench, a diagnostic benchmark that tests AI agents' ability to maintain and update structured beliefs through iterative hypothesis-intervention-feedback cycles. The benchmark reveals that performance degrades significantly with increased complexity variables, and identifies limitations in long-range structured information integration as a key bottleneck for scientific discovery agents.