WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning
Researchers introduce WorldReasoner, an evaluation framework that assesses whether language model agents can genuinely forecast real-world events through valid reasoning rather than memorization or fabrication. The framework evaluates forecasts across three dimensions—outcome accuracy, evidence quality, and causal reasoning—using 345 resolved tasks built from over 14,000 articles, revealing that agents struggle to convert grounded evidence into properly calibrated probabilities despite improvements in temporally valid retrieval.
WorldReasoner addresses a critical gap in AI evaluation: distinguishing between agents that produce correct answers through genuine reasoning versus those relying on memorized facts or confabulated evidence. Traditional accuracy metrics mask these fundamental differences, making it impossible to trust AI systems for real-world forecasting tasks where reasoning transparency matters as much as outcomes. The framework's three-axis evaluation—outcome quality, evidence quality, and reasoning quality—provides a more comprehensive picture of agent capabilities.
The research emerges from broader efforts to validate AI reasoning across complex domains. Prior work focused primarily on final-answer accuracy, overlooking how agents arrive at conclusions. WorldReasoner's agentic construction pipeline demonstrates scalable methodology: generating 345 forecasting tasks from 14,141 timestamped articles and constructing 8,087 event-based causal graphs. This dataset-building approach itself represents a technical contribution, establishing reproducible benchmarks for temporal reasoning.
The findings reveal significant limitations in current language model agents. Temporally valid retrieval—accessing only evidence available before the forecast date—emerged as the strongest accuracy driver, suggesting agents frequently fail basic temporal constraints in real deployments. While causal graph construction improved event recovery, agents demonstrated poor probability calibration despite having grounded evidence. This gap indicates agents can identify relevant facts but struggle with epistemic uncertainty quantification.
These limitations have downstream implications for AI applications in forecasting, intelligence analysis, and decision support systems. Organizations deploying language models for event prediction must implement additional validation layers beyond output accuracy. Future research should focus on bridging the evidence-to-calibration gap, potentially through uncertainty quantification methods or ensemble approaches that explicitly model confidence levels based on evidence quality and temporal distance.
- →WorldReasoner framework evaluates AI forecasting through three dimensions—outcome accuracy, evidence quality, and causal reasoning—rather than relying on final-answer scores alone.
- →Temporally valid retrieval is the strongest predictor of forecast accuracy, indicating agents often violate temporal constraints in current deployments.
- →Despite having access to grounded evidence, agents struggle to convert citations into properly calibrated probability estimates.
- →The agentic construction pipeline generates 345 resolved forecasting tasks from 14,141 articles, establishing scalable benchmarking methodology for temporal reasoning evaluation.
- →Causal graph construction improves key-event recovery but requires additional mechanisms to translate evidence into accurate uncertainty quantification.