AINeutralarXiv – CS AI · 15h ago6/10
🧠
Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization
Researchers introduce Verus-SpecGym, an evaluation environment for testing whether AI agents can automatically translate informal programming specifications into formal, machine-verifiable code. The benchmark reveals that frontier LLMs like Gemini 3.1 Pro achieve 77.8% accuracy on specification tasks, but generated specs remain brittle and frequently miss edge cases, input constraints, and validation rules that human experts catch.
🧠 Gemini