EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models
Researchers introduced EconCausal, a benchmark dataset of 10,490 annotated economic causal relationships from peer-reviewed studies, revealing that large language models struggle to properly condition predictions on changing contexts—achieving 88% accuracy in fixed scenarios but dropping to 41.3% when context shifts require reversing causal directions.
EconCausal addresses a critical gap in LLM evaluation: the ability to reason about context-dependent causal relationships in economics and finance. Traditional benchmarks often test LLMs on isolated facts or fixed scenarios, but real-world decision-making requires understanding how institutional, regulatory, and market contexts fundamentally alter causal relationships. The same policy intervention can produce opposite effects depending on timing, jurisdiction, or market conditions—a nuance that top-performing models consistently fail to capture.
The benchmark's construction through rigorous multi-stage validation across 2,595 peer-reviewed sources establishes it as a high-quality standard for evaluating economic reasoning. The performance gaps are striking: while models reach 88% accuracy in explicit, unchanging contexts, the 32.6 percentage-point drop when contexts shift reveals a fundamental limitation in reasoning flexibility. Models also exhibit poor calibration on null effects, correctly identifying absent causal relationships less than 14% of the time, suggesting overconfidence in directional predictions.
For AI practitioners building decision-support systems in finance and policy, these findings underscore the risks of deploying LLMs without explicit context-awareness mechanisms. The models' tendency to over-commit to directional signs—even when evidence is contradictory or context-dependent—poses real dangers in domains where reversals matter. Investment firms, regulatory bodies, and economic advisory services relying on LLM recommendations face material risks if models cannot reliably update predictions as market conditions or regulatory regimes change. The publicly released dataset should accelerate development of more robust context-aware reasoning architectures, particularly critical as LLMs increasingly influence financial and policy decisions.
- →LLMs achieve 88% accuracy on fixed economic causal relationships but drop to 41.3% when context changes require reversing directional predictions
- →Models correctly identify null effects less than 14% of the time, indicating systematic over-commitment to directional predictions
- →The benchmark comprises 10,490 annotated causal relationships from 2,595 top-tier economics and finance journal articles, providing high-quality training and evaluation data
- →Performance degradation worsens dramatically when misleading evidence is introduced, suggesting poor robustness to contradictory contextual signals
- →Context-aware reasoning gaps pose significant risks for LLM deployment in financial advisory and policy decision-support applications