#code-evaluation News & Analysis

2 articles tagged with #code-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AINeutralarXiv – CS AI · Apr 67/10

🧠

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Researchers introduce ProdCodeBench, a new benchmark for evaluating AI coding agents based on real developer-agent sessions from production environments. The benchmark addresses limitations of existing coding benchmarks by using authentic prompts, code changes, and tests across seven programming languages, with foundation models achieving solve rates between 53.2% and 72.2%.

AINeutralHugging Face Blog · Apr 165/107

🧠

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

LiveCodeBench introduces a new leaderboard for evaluating code-focused Large Language Models (LLMs) with an emphasis on holistic assessment and contamination-free testing. The benchmark aims to provide more accurate and reliable evaluation of AI coding capabilities by addressing common issues in existing evaluation methods.