Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages
Researchers introduce Multi-LCB, an extension of the LiveCodeBench evaluation framework that tests large language models across twelve programming languages instead of just Python. The benchmark reveals significant performance disparities across languages and evidence of Python overfitting in current LLMs, establishing a more rigorous standard for assessing real-world multilingual code generation capabilities.
Multi-LCB addresses a critical gap in LLM evaluation methodology by expanding beyond Python-only benchmarking. LiveCodeBench had become the standard for assessing code generation in LLMs through competitive programming problems with contamination controls, but its single-language focus obscured whether models could genuinely generalize across the polyglot requirements of production software engineering. The new benchmark transforms existing Python tasks into equivalent problems across languages including Java, C++, JavaScript, and others, maintaining the original framework's integrity while enabling systematic cross-language assessment.
The research reveals troubling patterns in current LLM capabilities. The evaluation of 24 models uncovered substantial performance degradation when moving beyond Python, suggesting models are overfit to Python training data rather than developing transferable code generation understanding. Language-specific contamination—where models have potentially seen language-specific training examples—further complicates the picture and raises questions about data preparation in LLM training pipelines.
For the AI development community, this work highlights a fundamental disconnect between benchmark performance and real-world applicability. Software engineers require models that perform consistently across diverse languages, yet existing evaluations may have masked significant capability gaps. This creates both a challenge and opportunity: developers must confront the limitations of current models while researchers gain clearer benchmarks for improvement. Multi-LCB's compatibility with future LCB updates ensures it will continuously pressure-test models as new problems are added, preventing the static evaluation problem that plagues aging benchmarks and establishing multilingual code generation as a non-negotiable requirement for production-ready models.
- →Multi-LCB expands code evaluation to twelve languages, revealing significant performance disparities and Python overfitting in current LLMs.
- →The benchmark maintains contamination controls and integrates automatically with future LiveCodeBench updates for ongoing assessment.
- →Evaluation of 24 models exposed language-specific contamination issues and substantial gaps in multilingual code generation capabilities.
- →Results suggest current LLMs may not generalize effectively across diverse programming languages required in real-world software engineering.
- →Multi-LCB establishes a new standard for evaluating cross-language code generation competence beyond Python-focused benchmarks.