AINeutralarXiv – CS AI · 8h ago6/10
🧠
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
Researchers introduce FormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving using Lean 4. The benchmark reveals that frontier LLMs like Claude Opus outperform specialized theorem provers at evaluating proof quality, suggesting that theorem proving ability does not transfer to proof evaluation tasks.
🧠 Claude🧠 Opus