y0news
#theorem-proving1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 6h ago8
๐Ÿง 

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.