AINeutralarXiv – CS AI · 10h ago6/10
🧠
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Researchers introduce Re²Math, a new benchmark for evaluating large language models' ability to retrieve relevant mathematical theorems and lemmas from academic literature during proof construction. The benchmark reveals significant gaps in current AI systems, with the best model achieving only 7.0% accuracy despite retrieving valid statements, indicating AI struggles to verify applicability to specific proof contexts.