🧠 AI⚪ NeutralImportance 6/10

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

arXiv – CS AI|Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar|May 28, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that Lean formal proof verification produces unreliable signals for validating natural-language mathematical reasoning, with accuracy varying from 96% at high coverage to 20% at low coverage. They introduce COVCAL, a risk-control method that certifies when partial formal signals can be trusted, showing that feasibility depends critically on autoformalization quality and coverage rates.

Analysis

This paper addresses a fundamental challenge in AI-assisted mathematical reasoning: formal verification tools like Lean provide incomplete and sometimes misleading validation signals. The researchers discovered that proof success in Lean correlates strongly with answer correctness only when coverage is high, but the coverage itself remains sparse with current autoformalization models. A 7B parameter formalizer achieves only 28% coverage, and manual audits revealed approximately 43% of proved statements don't actually correspond to correct answers, indicating systematic faithfulness issues in the formalization process.

The work builds on growing interest in using formal verification to improve AI reasoning systems, particularly as mathematical problem-solving becomes a benchmark for evaluating advanced language models. Previous approaches assumed that successfully formalized and proved statements reliably indicate correct answers, but this research exposes the nuance required for trustworthy integration of formal methods into AI evaluation pipelines.

The COVCAL framework introduces selective-risk bounds using either conservative Bonferroni corrections or tighter calibration rules, allowing systems to abstain when confidence is insufficient rather than making unreliable claims. Results show dramatic differences between formalizers: the generic 7B model's sparse signal makes risk-controlled acceptance infeasible, while a specialized prover-tuned formalizer reaching 79% coverage enables acceptance of approximately 48% of problems at 98% accuracy. This suggests that formalizer architecture and specialization significantly impact the viability of formal verification as a trustworthy signal.

The findings highlight that self-consistency baselines already achieve 91% accuracy, establishing an important benchmark. For developers building AI mathematical reasoning systems, the research clarifies that formal verification isn't universally trustworthy without understanding and controlling for coverage-dependent signal quality.

Key Takeaways

→Lean proof success accuracy varies dramatically from 96% to 20% depending on formalization coverage, making signal reliability highly context-dependent.
→Current autoformalization models suffer from both low coverage (28% for 7B models) and faithfulness issues, with only 43% of proved statements actually corresponding to correct answers.
→COVCAL's risk-controlled framework enables selective acceptance with certified error bounds, abstaining rather than making unreliable claims when coverage is insufficient.
→Formalizer specialization is critical: prover-optimized models reach 79% coverage enabling feasible risk-controlled acceptance, while generic models remain infeasible on all tested partitions.
→Self-consistency baselines achieving 91% accuracy establish a high bar that formal verification must exceed to provide meaningful additional validation signal.