Evaluation of LLMs for Mathematical Formalization in Lean
Researchers compared Large Language Models' ability to generate formal mathematical proofs in Lean 4, finding that Gemini 3.1 Pro and Claude Opus 4.7 achieved the highest success rates (92% and 86% respectively), while NVIDIA Nemotron 3 Super and GPT-OSS 120B offered the best cost-efficiency at under $0.01 per correct proof.
This research addresses a critical gap in understanding which LLMs perform best for mathematical formalization, a specialized task requiring both deep reasoning and precise formal syntax. The study's comparison across multiple models using rigorous metrics (pass@k and refine@k) provides practical guidance for researchers and developers integrating LLMs into formal verification workflows. The dramatic improvement in LLM performance on formal proofs over recent years signals the maturation of these models for domain-specific technical applications beyond natural language tasks.
Mathematical formalization in proof assistants like Lean has historically been a bottleneck limiting formal verification adoption. As LLMs demonstrate capability in this area, they lower barriers for integrating formal methods into software development and mathematical research, potentially accelerating verification practices in critical systems. The research spans this convergence by evaluating established models against emerging alternatives, establishing baseline performance metrics for the field.
For the AI development community, these results highlight performance trade-offs between frontier models and cost-efficient alternatives. Organizations face clear decisions: invest in premium models like Gemini or Claude for maximum accuracy on mathematical tasks, or deploy open-source solutions for budget-constrained deployments with acceptable results. This finding has implications for institutions building AI-assisted theorem proving systems and formal verification tools, where proof generation speed and cost directly impact research productivity.
The study should prompt continued evaluation as new LLM versions release and open-source models improve. Researchers should monitor whether specialized fine-tuning on mathematical datasets can close performance gaps, potentially making cost-efficient models more competitive for this vertical application.
- βGemini 3.1 Pro leads with 92% success on miniF2F dataset using refine@32 metric
- βCost-efficient open-source models achieve competitive accuracy at under $0.01 per proof
- βLLMs demonstrate significant capability improvement for formal mathematical proof generation
- βFrontier and budget-conscious solutions offer distinct trade-offs for different deployment scenarios
- βFormal proof benchmarking reveals emerging competitive landscape among diverse model architectures