AIBearisharXiv – CS AI · 18h ago7/10
🧠
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs
A comprehensive evaluation of 9 open-source coding LLMs across 2,707 LeetCode problems in 12 programming languages reveals significant performance gaps compared to human developers. The best model achieves only 23.64% correctness versus a 57.2% human baseline, with performance varying substantially across languages and problem types, indicating that aggregate benchmarks mask critical weaknesses in code generation systems.