TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics
Researchers introduce TheoremBench, a comprehensive Lean4 benchmark for evaluating large language models on formal mathematics theorem proving. Unlike existing competition-focused benchmarks, TheoremBench assesses how LLMs handle longer, dependency-rich mathematical proofs through both standalone theorems and structured families of related subtasks, revealing that current models remain inefficient and biased toward simpler problems.
TheoremBench addresses a critical gap in AI evaluation methodology by moving beyond contest-style problem sets that dominate current benchmarking practices. The benchmark's dual-format design—offering both standalone theorems and premise-expanded versions with supporting subtheorems—enables nuanced assessment of how language models approach complex mathematical reasoning. This structural innovation is particularly important because formal mathematics often involves intricate chains of dependencies, and models that excel at isolated problems may struggle with the holistic reasoning required in real mathematical development.
The research reveals significant limitations in current theorem-proving LLMs. Despite recent improvements, these models demonstrate persistent biases toward simpler subproblems and generate unnecessarily verbose proof traces rather than elegant, efficient solutions. The introduction of coverage and token-efficiency metrics provides granular visibility into proof behavior patterns previously masked by binary success/failure evaluations. This finding suggests that raw performance improvements may obscure underlying inefficiencies in how models construct formal proofs.
For the AI development community, TheoremBench's approach signals an important methodological shift toward more realistic, structure-aware evaluation frameworks. The benchmark's emphasis on dependency-rich problems closer to actual mathematical research work creates better feedback signals for model improvement. The observation that explicit premises substantially boost performance also indicates that context engineering and proof decomposition strategies deserve greater research attention. Looking forward, the field should monitor whether future models can overcome the identified biases and generate more efficient proofs, as these improvements would represent meaningful progress toward genuine mathematical reasoning capabilities rather than surface-level benchmark gains.
- →TheoremBench evaluates LLMs on realistic, dependency-rich mathematical theorems rather than isolated competition problems
- →Current theorem-proving models show strong biases toward easier subproblems and generate inefficient proof traces
- →Providing explicit premises substantially improves performance, suggesting proof decomposition is a valuable strategy
- →New coverage and token-efficiency metrics reveal qualitative proof behavior differences masked by traditional success metrics
- →The benchmark's structure enables evaluation of partial progress through theorem proof dependencies, not just final success