VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving
VERITAS introduces a zero-shot framework for formal theorem proving that leverages rich verifier feedback signals rather than binary pass/fail outcomes. Using a two-phase approach combining Best-of-N sampling with critic-guided Monte Carlo Tree Search, the system achieves 40.6% accuracy on miniF2F benchmarks and demonstrates particular strength in combinatorial problems where iterative lemma recovery is critical.
VERITAS addresses a fundamental inefficiency in LLM-based formal provers: the discarding of nuanced verifier information. Traditional systems reduce syntax errors, type mismatches, and partial goal progress into simple binary signals, losing valuable guidance for proof search. The framework's two-phase protocol rescues this information by first executing a Best-of-N sampling phase, then feeding failures as explicit negative examples into a critic-guided MCTS exploration phase. This architectural choice ensures that no previously solved theorems are lost while creating space for feedback-driven discovery.
The 40.6% miniF2F performance represents meaningful progress over comparable baselines, but the most revealing results emerge from VERITAS-CombiBench, a newly released 55-theorem combinatorics benchmark. Here, Best-of-5 sampling collapses to 1.8% accuracy while Portfolio methods achieve 3.6%—counterintuitively outperforming unguided sampling. VERITAS reaches 7.3%, demonstrating that when correct solutions require iterative lemma name recovery from verifier feedback, structured exploration dramatically outweighs brute-force sampling. This insight has implications for how AI systems should integrate verification signals across domains where stepwise refinement matters more than raw computational power.
For the broader AI research community, VERITAS validates that verifier feedback mechanisms deserve first-class treatment in proof search rather than downstream filtering. The release of VERITAS-CombiBench with open artifacts supports reproducibility and establishes a more challenging evaluation surface. The work exemplifies how reasoning systems benefit from treating error signals as exploration guides rather than termination conditions.
- →VERITAS achieves 40.6% on miniF2F by routing verifier signals into proof search instead of binary pass/fail classification
- →Two-phase critic-guided MCTS preserves Phase 1 solutions while enabling feedback-driven exploration in Phase 2
- →VERITAS-CombiBench reveals that unguided sampling underperforms structured feedback methods when iterative lemma recovery is required
- →Best-of-N sampling achieves only 1.8% on combinatorics benchmarks while VERITAS reaches 7.3%, exposing sampling limitations
- →Framework demonstrates that rich verifier information has untapped value for improving formal reasoning in AI systems