y0news
#mathematics2 articles
2 articles
AINeutralarXiv โ€“ CS AI ยท 6h ago8
๐Ÿง 

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.

AIBullisharXiv โ€“ CS AI ยท 6h ago5
๐Ÿง 

Aletheia tackles FirstProof autonomously

Aletheia, a mathematics research agent powered by Gemini 3 Deep Think, successfully solved 6 out of 10 problems in the inaugural FirstProof challenge. The AI system demonstrated autonomous mathematical problem-solving capabilities, with expert assessments confirming its solutions though some disagreement existed on Problem 8.