AINeutralarXiv – CS AI · 6h ago6/10
🧠
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
ForeSci introduces a new benchmark for evaluating whether large language model agents can make forward-looking research decisions using only historical evidence, testing 500 tasks across AI domains. The research reveals that while explicit evidence organization improves traceability, a fundamental evidence-decision decoupling problem persists where agents cite relevant sources but reach incorrect conclusions.