#theorem-proving News & Analysis

19 articles tagged with #theorem-proving. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

Researchers introduce proof-state snapshotting, a technique that accelerates automated theorem proving in Lean 4 by reusing elaborated proof states across parallel search branches instead of reconstructing them. The method achieves 5.6-50x speedups (averaging 14x) on benchmark problems, addressing a critical bottleneck where per-branch overhead from import loading and elaboration consumed over 99% of computation time.

AIBullisharXiv – CS AI · May 97/10

🧠

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

Researchers have introduced the AI co-mathematician, an interactive workbench that leverages agentic AI to assist mathematicians in solving open-ended research problems. The system achieves state-of-the-art results on hard benchmarks, scoring 48% on FrontierMath Tier 4, and demonstrates practical value by helping researchers solve open problems and identify new research directions.

AIBullisharXiv – CS AI · Mar 97/10

🧠

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Researchers propose a new method for training large language models (LLMs) that addresses the diversity loss problem in reinforcement learning approaches. Their technique uses the α-divergence family to better balance precision and diversity in reasoning tasks, achieving state-of-the-art performance on theorem-proving benchmarks.

AIBullisharXiv – CS AI · Mar 57/10

🧠

LeanTutor: Towards a Verified AI Mathematical Proof Tutor

Researchers have developed LeanTutor, a proof-of-concept AI system that combines Large Language Models with theorem provers to create a mathematically verified proof tutor. The system features three modules for autoformalization, proof-checking, and natural language feedback, evaluated using PeanoBench, a new dataset of 371 Peano Arithmetic proofs.

AINeutralarXiv – CS AI · Mar 47/104

🧠

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Researchers have introduced SorryDB, a dynamic benchmark for evaluating AI systems' ability to prove mathematical theorems using the Lean proof assistant. The benchmark draws from 78 real-world formalization projects and addresses limitations of static benchmarks by providing continuously updated tasks that better reflect community needs.

AIBullisharXiv – CS AI · Mar 37/103

🧠

GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving

Researchers introduce GAR (Generative Adversarial Reinforcement Learning), a new AI training framework that jointly trains problem generators and solvers in an adversarial loop for formal theorem proving. The method shows significant improvements in mathematical proof capabilities, with models achieving 4.20% average relative improvement on benchmark tests.

AINeutralarXiv – CS AI · Feb 277/107

🧠

LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Researchers introduced LeanCat, a benchmark comprising 100 category-theory tasks in Lean to test AI's formal theorem proving capabilities. State-of-the-art models achieved only 12% success rates, revealing significant limitations in abstract mathematical reasoning, while a new retrieval-augmented approach doubled performance to 24%.

AIBullishOpenAI News · Feb 27/105

🧠

Solving (some) formal math olympiad problems

Researchers have developed a neural theorem prover for Lean that successfully solved challenging high-school mathematics olympiad problems, including those from AMC12, AIME competitions, and two problems adapted from the International Mathematical Olympiad (IMO). This represents a significant advancement in AI's ability to handle formal mathematical reasoning and proof generation.

AIBullisharXiv – CS AI · 2d ago6/10

🧠

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Researchers introduce Hilbert-Geo, a neural-symbolic AI framework for solving solid geometry problems by combining formal language representation with theorem-based reasoning. The system achieves 77.3% accuracy on solid geometry tasks, significantly outperforming leading AI models like GPT-4 and Gemini-2.5-pro, demonstrating advances in multimodal geometric reasoning.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Researchers introduce FormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving using Lean 4. The benchmark reveals that frontier LLMs like Claude Opus outperform specialized theorem provers at evaluating proof quality, suggesting that theorem proving ability does not transfer to proof evaluation tasks.

🧠 Claude🧠 Opus

AINeutralarXiv – CS AI · May 126/10

🧠

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

Researchers introduce VeriContest, a benchmark of 946 competitive-programming problems designed to evaluate AI models' ability to generate not just functional code but also formal specifications and machine-checkable proofs. Testing ten state-of-the-art models reveals a dramatic capability gap: while the strongest model achieves 92% accuracy on code generation alone, performance plummets to 48% on specifications, 14% on proofs, and just 5% end-to-end, identifying proof generation as the critical bottleneck for verifiable code generation systems.

AINeutralarXiv – CS AI · May 116/10

🧠

State Representation and Termination for Recursive Reasoning Systems

Researchers present a formal framework for recursive reasoning systems that addresses two critical design challenges: how to represent evolving reasoning states and when to terminate iteration. The paper introduces an epistemic state graph representation and proposes the 'order-gap' metric as a stopping criterion, with theoretical guarantees for when this criterion provides meaningful guidance.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Learning to Reason with Insight for Informal Theorem Proving

Researchers propose DeepInsightTheorem, a framework that teaches large language models to improve informal theorem proving by explicitly extracting and learning core mathematical techniques. The hierarchical dataset combined with a multi-stage training strategy enables LLMs to perform more insightful mathematical reasoning, outperforming existing baseline approaches on challenging benchmarks.

AINeutralarXiv – CS AI · Apr 106/10

🧠

ProofSketcher: Hybrid LLM + Lightweight Proof Checker for Reliable Math/Logic Reasoning

Researchers present ProofSketcher, a hybrid system combining large language models with lightweight proof verification to address mathematical reasoning errors in AI-generated proofs. The approach bridges the gap between LLM efficiency and the formal rigor of interactive theorem provers like Lean and Coq, enabling more reliable automated reasoning without requiring full formalization.

$AVAX

AINeutralarXiv – CS AI · Mar 55/10

🧠

Mathematicians in the age of AI

A research paper discusses how AI systems are now capable of proving research-level mathematical theorems both formally and informally. The paper advocates for mathematicians to adapt to this technological disruption and consider both the challenges and opportunities it presents for mathematical practice.

AINeutralarXiv – CS AI · Mar 27/1020

🧠

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.

AIBullishSynced Review · Apr 306/106

🧠

DeepSeek Unveils DeepSeek-Prover-V2: Advancing Neural Theorem Proving with Recursive Proof Search and a New Benchmark

DeepSeek AI has released DeepSeek-Prover-V2, an open-source large language model specifically designed for Lean 4 theorem proving. The model employs recursive proof search methodology and uses DeepSeek-V3 for training data generation with reinforcement learning, achieving top performance results on the MiniF2F benchmark.

AIBullishOpenAI News · Sep 76/105

🧠

Generative language modeling for automated theorem proving

The article discusses the application of generative language models to automated theorem proving, representing an advancement in AI's ability to generate mathematical proofs. This development could enhance AI systems' reasoning capabilities and formal verification processes.

AINeutralOpenAI News · Jun 24/106

🧠

GamePad: A learning environment for theorem proving

GamePad is introduced as a learning environment specifically designed for theorem proving applications. The platform appears to focus on providing educational tools and resources for mathematical proof development and validation.