AINeutralarXiv โ CS AI ยท 4h ago7/10
๐ง
ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents
Researchers introduce ProdCodeBench, a new benchmark for evaluating AI coding agents based on real developer-agent sessions from production environments. The benchmark addresses limitations of existing coding benchmarks by using authentic prompts, code changes, and tests across seven programming languages, with foundation models achieving solve rates between 53.2% and 72.2%.