y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

arXiv – CS AI|Sayed Erfan Arefin|
🤖AI Summary

A comprehensive evaluation of 9 open-source coding LLMs across 2,707 LeetCode problems in 12 programming languages reveals significant performance gaps compared to human developers. The best model achieves only 23.64% correctness versus a 57.2% human baseline, with performance varying substantially across languages and problem types, indicating that aggregate benchmarks mask critical weaknesses in code generation systems.

Analysis

This large-scale empirical study addresses a fundamental limitation in how code-generation models are currently evaluated and ranked. Rather than relying on single-language benchmarks or aggregate pass rates that obscure real-world performance variations, researchers conducted execution-grounded testing across 325,343 problem-model-language combinations, capturing compile errors, runtime failures, and static analysis quality signals. The findings expose a troubling reality: open-source coding LLMs remain substantially below human programmer performance, with the gap widening on harder problems and across less commonly optimized languages.

The research contextualizes a broader industry trend where benchmark inflation creates false confidence in model capabilities. Previous evaluations using narrow datasets or single programming languages have led to inflated performance claims that don't translate to production use. This study demonstrates that model rankings shift dramatically depending on evaluation slice—Qwen2.5-Coder excels on hard problems while Gemma-2 leads on static code quality—revealing tradeoffs that single-metric leaderboards entirely obscure.

For developers and enterprises, the findings carry immediate implications. The discovery that 63% of failures stem from compile errors suggests current models struggle with fundamental syntax and language semantics, not just algorithmic reasoning. This indicates that code LLMs require substantial refinement before deployment in autonomous programming systems. The multilingual evaluation framework itself becomes valuable as AI systems increasingly serve global development communities with diverse language preferences.

Looking ahead, the research establishes benchmarking methodology that future model developers should adopt. Success metrics will increasingly need to balance functional correctness, code quality, language coverage, and failure-mode transparency rather than pursuing single aggregate scores.

Key Takeaways
  • The best open-source coding LLM achieves only 23.64% correctness versus 57.2% human baseline, indicating significant performance gaps in production-ready AI coding assistants
  • Model performance rankings shift substantially across programming languages and problem difficulty, revealing that single-language benchmarks provide misleading comparisons
  • Compile errors account for 63% of failures, suggesting current models struggle with basic language semantics before reaching semantic correctness challenges
  • Static code quality metrics diverge significantly from functional correctness, indicating single-metric evaluations mask critical performance dimensions
  • Multilingual, execution-grounded evaluation methodology reveals hidden tradeoffs that narrow benchmarks obscure, establishing better standard for LLM assessment
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles