🧠 AI⚪ NeutralImportance 6/10

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

arXiv – CS AI|Jie Li, Wenzhao Wu, Junqi Hu, Qinrui Zheng, Bowen Wu, Juepeng Zheng, Yutong Lu, Haohuan Fu|June 4, 2026 at 04:00 AM

🤖AI Summary

Researchers introduced CodegenBench, a benchmark suite evaluating large language models' ability to generate efficient code across diverse CPU architectures including x86_64, Sunway, and Kunpeng. The study reveals that while LLMs excel at generating optimized code for mainstream architectures, they significantly underperform on domain-specific platforms with limited public documentation, exposing critical gaps in cross-platform generalization.

Analysis

CodegenBench addresses a meaningful gap in LLM evaluation by testing code generation capabilities beyond GPU-accelerated environments and general-purpose computing. The benchmark's focus on high-performance computing across heterogeneous architectures reflects the growing importance of efficient code generation across specialized hardware platforms used in supercomputing and enterprise infrastructure.

The research builds on extensive prior work evaluating LLMs on code generation but extends the scope to underexplored territory. Previous benchmarks emphasized PyTorch, CUDA, and mainstream platforms where substantial training data exists. This study's inclusion of lesser-known architectures like Sunway and Kunpeng—commonly used in Chinese supercomputing facilities—reveals systemic limitations in how LLMs generalize knowledge to unfamiliar hardware ecosystems.

The findings have tangible implications for organizations deploying LLMs for code optimization. Development teams cannot reliably outsource code generation for specialized architectures, forcing continued reliance on domain experts for performance-critical systems. This limitation undermines broader productivity gains promised by generative AI in software development.

The open-sourcing of CodegenBench and evaluation infrastructure establishes a foundation for targeted improvements. Future research will likely focus on techniques for improving LLM performance on architectures with limited training data, potentially through transfer learning or architecture-agnostic optimization strategies. Organizations developing proprietary hardware platforms may also increase training data sharing to improve LLM support for their ecosystems, creating competitive pressure in the AI tools market.

Key Takeaways

→LLMs generate efficient code for mainstream architectures like x86_64 but significantly degrade on domain-specific platforms with limited documentation
→The benchmark suite comprises 106 BLAS routines plus 40 specialized kernels across three distinct hardware architectures
→Current LLMs perform best on moderately difficult problems requiring concise code implementations
→Cross-platform code generalization remains a critical unresolved challenge in LLM-driven development
→Open-source benchmark release enables systematic research into improving LLM performance on specialized hardware