Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Researchers introduce Re²Math, a new benchmark for evaluating large language models' ability to retrieve relevant mathematical theorems and lemmas from academic literature during proof construction. The benchmark reveals significant gaps in current AI systems, with the best model achieving only 7.0% accuracy despite retrieving valid statements, indicating AI struggles to verify applicability to specific proof contexts.
Re²Math addresses a critical limitation in AI-assisted mathematical research: while large language models excel at closed-world reasoning tasks, they struggle to ground their reasoning in actual scholarly literature and verify that retrieved sources apply to specific proof steps. This benchmark transforms a nebulous research challenge into a measurable diagnostic task, decomposing the problem into citation recall, source grounding, and proof-gap sufficiency. The low 7.0% accuracy despite higher source-grounding rates reveals a fundamental gap between retrieving mathematically valid theorems and establishing their relevance to local proof contexts. This matters because research mathematics fundamentally depends on building upon established results, and an AI assistant must not only know theorems exist but understand whether their assumptions and conditions align with current work. The benchmark's design—using frozen retrieval artifacts and supporting continual expansion—enables reproducible evaluation while remaining flexible as new mathematical results emerge. For the AI research community, this work benchmarks a capability needed for practical mathematical assistance at research level. The findings suggest current systems need better mechanisms for contextual reasoning about applicability, not just information retrieval. This creates opportunities for developing novel approaches to mathematical tool use that combine retrieval with deeper verification of relevance. As AI increasingly enters academic workflows, establishing rigorous benchmarks for literature-grounded reasoning becomes essential for ensuring reliability and trustworthiness in high-stakes domains like mathematics research.
- →Current best-performing models achieve only 7.0% accuracy on Re²Math benchmark despite successfully retrieving valid mathematical statements
- →The gap between source-grounding rates and applicability verification indicates AI fails at contextual reasoning about theorem relevance to proof steps
- →Re²Math's modular evaluation structure decouples citation recall, grounding, and sufficiency into measurable diagnostic components
- →Frozen retrieval artifacts ensure reproducibility while allowing continual benchmark expansion with newly constructed instances
- →The benchmark identifies a critical capability needed for practical AI-assisted mathematical research before deployment