AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce proof-state snapshotting, a technique that accelerates automated theorem proving in Lean 4 by reusing elaborated proof states across parallel search branches instead of reconstructing them. The method achieves 5.6-50x speedups (averaging 14x) on benchmark problems, addressing a critical bottleneck where per-branch overhead from import loading and elaboration consumed over 99% of computation time.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers have introduced the AI co-mathematician, an interactive workbench that leverages agentic AI to assist mathematicians in solving open-ended research problems. The system achieves state-of-the-art results on hard benchmarks, scoring 48% on FrontierMath Tier 4, and demonstrates practical value by helping researchers solve open problems and identify new research directions.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers propose a new method for training large language models (LLMs) that addresses the diversity loss problem in reinforcement learning approaches. Their technique uses the α-divergence family to better balance precision and diversity in reasoning tasks, achieving state-of-the-art performance on theorem-proving benchmarks.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers have developed LeanTutor, a proof-of-concept AI system that combines Large Language Models with theorem provers to create a mathematically verified proof tutor. The system features three modules for autoformalization, proof-checking, and natural language feedback, evaluated using PeanoBench, a new dataset of 371 Peano Arithmetic proofs.
AINeutralarXiv – CS AI · Mar 47/104
🧠Researchers have introduced SorryDB, a dynamic benchmark for evaluating AI systems' ability to prove mathematical theorems using the Lean proof assistant. The benchmark draws from 78 real-world formalization projects and addresses limitations of static benchmarks by providing continuously updated tasks that better reflect community needs.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce GAR (Generative Adversarial Reinforcement Learning), a new AI training framework that jointly trains problem generators and solvers in an adversarial loop for formal theorem proving. The method shows significant improvements in mathematical proof capabilities, with models achieving 4.20% average relative improvement on benchmark tests.
AINeutralarXiv – CS AI · Feb 277/107
🧠Researchers introduced LeanCat, a benchmark comprising 100 category-theory tasks in Lean to test AI's formal theorem proving capabilities. State-of-the-art models achieved only 12% success rates, revealing significant limitations in abstract mathematical reasoning, while a new retrieval-augmented approach doubled performance to 24%.
AIBullishOpenAI News · Feb 27/105
🧠Researchers have developed a neural theorem prover for Lean that successfully solved challenging high-school mathematics olympiad problems, including those from AMC12, AIME competitions, and two problems adapted from the International Mathematical Olympiad (IMO). This represents a significant advancement in AI's ability to handle formal mathematical reasoning and proof generation.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers introduce Hilbert-Geo, a neural-symbolic AI framework for solving solid geometry problems by combining formal language representation with theorem-based reasoning. The system achieves 77.3% accuracy on solid geometry tasks, significantly outperforming leading AI models like GPT-4 and Gemini-2.5-pro, demonstrating advances in multimodal geometric reasoning.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce FormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving using Lean 4. The benchmark reveals that frontier LLMs like Claude Opus outperform specialized theorem provers at evaluating proof quality, suggesting that theorem proving ability does not transfer to proof evaluation tasks.
🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce VeriContest, a benchmark of 946 competitive-programming problems designed to evaluate AI models' ability to generate not just functional code but also formal specifications and machine-checkable proofs. Testing ten state-of-the-art models reveals a dramatic capability gap: while the strongest model achieves 92% accuracy on code generation alone, performance plummets to 48% on specifications, 14% on proofs, and just 5% end-to-end, identifying proof generation as the critical bottleneck for verifiable code generation systems.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers present a formal framework for recursive reasoning systems that addresses two critical design challenges: how to represent evolving reasoning states and when to terminate iteration. The paper introduces an epistemic state graph representation and proposes the 'order-gap' metric as a stopping criterion, with theoretical guarantees for when this criterion provides meaningful guidance.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers propose DeepInsightTheorem, a framework that teaches large language models to improve informal theorem proving by explicitly extracting and learning core mathematical techniques. The hierarchical dataset combined with a multi-stage training strategy enables LLMs to perform more insightful mathematical reasoning, outperforming existing baseline approaches on challenging benchmarks.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers present ProofSketcher, a hybrid system combining large language models with lightweight proof verification to address mathematical reasoning errors in AI-generated proofs. The approach bridges the gap between LLM efficiency and the formal rigor of interactive theorem provers like Lean and Coq, enabling more reliable automated reasoning without requiring full formalization.
$AVAX
AINeutralarXiv – CS AI · Mar 55/10
🧠A research paper discusses how AI systems are now capable of proving research-level mathematical theorems both formally and informally. The paper advocates for mathematicians to adapt to this technological disruption and consider both the challenges and opportunities it presents for mathematical practice.
AINeutralarXiv – CS AI · Mar 27/1020
🧠Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.
AIBullishSynced Review · Apr 306/106
🧠DeepSeek AI has released DeepSeek-Prover-V2, an open-source large language model specifically designed for Lean 4 theorem proving. The model employs recursive proof search methodology and uses DeepSeek-V3 for training data generation with reinforcement learning, achieving top performance results on the MiniF2F benchmark.
AIBullishOpenAI News · Sep 76/105
🧠The article discusses the application of generative language models to automated theorem proving, representing an advancement in AI's ability to generate mathematical proofs. This development could enhance AI systems' reasoning capabilities and formal verification processes.
AINeutralOpenAI News · Jun 24/106
🧠GamePad is introduced as a learning environment specifically designed for theorem proving applications. The platform appears to focus on providing educational tools and resources for mathematical proof development and validation.