AINeutralarXiv – CS AI · 7h ago6/10
🧠
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
Researchers introduce CodeGolf Bench, a new benchmark for evaluating Large Language Models' ability to generate concise code across 60 programming languages. The study reveals that reasoning-capable models significantly outperform standard LLMs, achieving 70.97% average percentile performance on code golf tasks, particularly excelling in languages with strict syntax requirements.