y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

arXiv – CS AI|Yanjie He|
🤖AI Summary

A new study reveals that large language models fail at counterfactual reasoning when policy findings contradict intuitive expectations, despite performing well on obvious cases. The research demonstrates that chain-of-thought prompting paradoxically worsens performance on counter-intuitive scenarios, suggesting current LLMs engage in 'slow talking' rather than genuine deliberative reasoning.

Analysis

This research exposes a fundamental limitation in how LLMs approach complex reasoning tasks, with implications extending beyond academia into policy analysis and decision-support systems. The study evaluated four frontier models across 2,400 trials using a carefully constructed benchmark grounded in peer-reviewed economics and social science research. The findings reveal a critical gap between LLM capabilities: models excel when policy outcomes align with common sense but systematically fail when evidence contradicts intuition.

The chain-of-thought paradox represents the most striking discovery. While structured prompting dramatically improves accuracy on obvious cases, this benefit collapses for counter-intuitive findings (interaction ratio of 0.053, p<0.001). This suggests that explicit reasoning steps may actually entrench initial biases rather than overcome them. The dominance of intuitiveness as an explanatory factor—accounting for more variance than model architecture or prompting strategy—indicates that LLM reasoning is fundamentally constrained by underlying knowledge patterns rather than reasoning capability.

The knowledge-reasoning dissociation is particularly revealing: citation familiarity shows no correlation with accuracy, proving models possess relevant information but cannot deploy it effectively when findings violate expectations. This challenges assumptions that scaling LLMs will naturally improve their policy analysis capabilities. For practitioners relying on LLMs for policy evaluation, this research suggests significant limitations in current systems' ability to handle counterintuitive but empirically validated findings. Organizations should implement human oversight mechanisms when LLMs evaluate policies with non-obvious implications. Future development should focus on mechanisms that decouple reasoning from intuitive priors rather than simply adding more parameters or prompting sophistication.

Key Takeaways
  • Chain-of-thought prompting backfires on counter-intuitive policy findings, suggesting LLMs entrench biases when explicitly reasoning through non-obvious cases.
  • Intuitiveness explains more variance in model performance than the choice of model or prompting strategy, revealing fundamental constraints in LLM reasoning.
  • Models possess relevant knowledge about policies but systematically fail to reason with it when findings contradict common expectations.
  • Current LLM 'slow thinking' may merely be 'slow talking'—producing reasoning outputs without genuine deliberative substance.
  • Human oversight remains critical for policy evaluation using LLMs, particularly for scenarios with counter-intuitive but empirically validated conclusions.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles