AINeutralarXiv โ CS AI ยท 4h ago5
๐ง
LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.