AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
Researchers introduced AlgoVeri, a unified benchmark for evaluating AI-generated formally verified code across three major verification systems (Dafny, Verus, and Lean). The benchmark reveals significant performance disparities depending on the verification language, with frontier AI models achieving 40.3% success in Dafny but only 7.8% in Lean, highlighting fundamental challenges in cross-paradigm code verification.
AlgoVeri addresses a critical gap in AI evaluation methodology by creating the first standardized benchmark that directly compares verified code generation across multiple formal verification systems. Previous benchmarks operated in isolation with incomparable metrics, preventing meaningful assessment of AI capabilities in this specialized domain. This research matters because formally verified code is increasingly important for security-critical applications, and understanding where AI excels or fails in this space informs both tool development and realistic deployment expectations.
The stark performance differences across verification languages reveal fundamental architectural tradeoffs. Dafny's high-level abstractions and SMT automation allow models to focus on logical correctness, enabling competitive performance from frontier models like Gemini-3 Flash. Verus introduces systems-level memory constraints that significantly reduce success rates, while Lean's explicit proof construction demands create the steepest barrier. These findings suggest that verification system design directly constrains AI effectiveness rather than model capability alone being the limiting factor.
The divergent test-time compute dynamics between models carry important implications for AI development strategies. Gemini-3's effective use of iterative repair to triple Dafny pass rates indicates that the model's architecture supports refinement-based problem-solving, whereas GPT-OSS saturates early without similar gains. This suggests future verification-focused models should prioritize iterative improvement mechanisms.
The research indicates that language-specific barriers—syntactic and semantic complexity in Verus and Lean—trap models in unproductive error loops. Developers building verified systems should expect substantially different AI assistance quality depending on their chosen verification framework, and verification language designers should consider AI integration as a first-class design criterion for practical adoption.
- →Frontier AI models achieve 40.3% success in Dafny but collapse to 7.8% in Lean, revealing verification language design as a primary performance constraint
- →Iterative repair mechanisms significantly boost performance in some models (tripling Dafny pass rates) but saturate early in others, indicating architectural differences in AI problem-solving approaches
- →Verus and Lean's syntactic and semantic barriers create persistent error loops that prevent models from reaching logical correctness phases of verification tasks
- →AlgoVeri's standardized evaluation framework enables direct comparison of AI capabilities across verification systems for the first time, addressing a critical gap in AI benchmarking methodology
- →Verification system design significantly impacts practical AI-assisted code generation effectiveness, making it a critical consideration for both language designers and development teams