FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
Researchers introduce FormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving using Lean 4. The benchmark reveals that frontier LLMs like Claude Opus outperform specialized theorem provers at evaluating proof quality, suggesting that theorem proving ability does not transfer to proof evaluation tasks.
FormalRewardBench addresses a critical bottleneck in AI-assisted formal mathematics: the evaluation of proof quality beyond simple binary correctness signals. Neural theorem provers currently rely on reinforcement learning with verifiable rewards (RLVR), which provides sparse feedback that fails to guide models through partial progress on difficult problems. The introduction of learned reward models offers potential solutions, but comparing these models previously required expensive and time-consuming RL training ablations.
The benchmark's design demonstrates sophisticated research methodology, employing five expert-curated error injection strategies to generate meaningful preference pairs. This approach captures realistic failure modes rather than arbitrary mistakes, enabling more nuanced evaluation of reward model performance. The 250 preference pair dataset, while modest in size, reflects the high cost of expert curation in formal mathematics.
The counterintuitive finding that frontier general-purpose LLMs (59.8% accuracy) substantially outperform specialized theorem provers (24.4%) has significant implications for model development strategy. This result suggests that broader reasoning capabilities and instruction-following ability matter more for proof evaluation than domain-specific theorem proving expertise. It challenges assumptions that scaling specialized models is the optimal path for formal mathematics applications.
For the AI research community, FormalRewardBench establishes a standardized evaluation framework that enables rapid iteration on reward model designs without expensive RL training cycles. This acceleration could accelerate progress in neural theorem proving by several months per iteration. The public release democratizes access to formal mathematics benchmarks, historically concentrated at specialized institutions.
- βFrontier LLMs achieve 59.8% accuracy at proof evaluation while specialized theorem provers achieve only 24.4%, indicating domain expertise does not transfer to reward modeling
- βFormalRewardBench provides the first standardized benchmark for evaluating formal mathematics reward models, eliminating expensive RL training ablations
- βFive distinct error injection strategies reveal that most injection mechanisms remain challenging, suggesting room for improvement in reward model robustness
- βThe benchmark's 250 preference pairs represent expert-curated training data reflecting realistic proof failure modes in Lean 4
- βPublic release of FormalRewardBench democratizes formal mathematics research and accelerates neural theorem prover development cycles