AINeutralarXiv – CS AI · 7h ago7/10
🧠
WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning
Researchers introduce WorldReasoner, an evaluation framework that assesses whether language model agents can genuinely forecast real-world events through valid reasoning rather than memorization or fabrication. The framework evaluates forecasts across three dimensions—outcome accuracy, evidence quality, and causal reasoning—using 345 resolved tasks built from over 14,000 articles, revealing that agents struggle to convert grounded evidence into properly calibrated probabilities despite improvements in temporally valid retrieval.