AINeutralarXiv – CS AI · 18h ago6/10
🧠
TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models
Researchers introduce TempoBench, a formally verified benchmark for evaluating temporal causal reasoning in large language models, revealing a significant gap between forward simulation performance (96% accuracy) and causal reasoning ability (below 25%). The study demonstrates that LLMs struggle with identifying minimal causal inputs, instead over-specifying by listing all possible inputs rather than reasoning about necessity.