AIBearisharXiv – CS AI · 6h ago7/10
🧠
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
Researchers discovered that large reasoning models (LRMs) exhibit a significant production-evaluation gap, scoring as low as 48% when evaluating flawed reasoning despite near-perfect solution generation. Using the VAIR dataset, the study reveals that LRMs suffer from answer confirmation bias—they verify conclusions rather than rigorously evaluate reasoning steps—unlike humans who perform similarly at both tasks.