A Study of LLMs' Preferences for Libraries and Programming Languages
A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.
This research addresses a critical blind spot in LLM evaluation frameworks that have historically focused on functional correctness while ignoring architectural and design-choice quality. The study's findings reveal a fundamental limitation in how current LLMs learn from training data: they internalize popularity signals as proxies for correctness, rather than developing contextual reasoning about when specific tools are appropriate. This distinction matters because while a NumPy solution may be functionally correct, it can introduce unnecessary dependencies, performance overhead, or maintenance complexity.
The preference for Python across diverse task contexts reflects training data skew, where Python dominates public code repositories and educational materials. This creates a compounding problem: as more Python-centric code is generated and potentially added to future training datasets, the bias intensifies. The finding that Rust—objectively superior for high-performance scenarios—appears zero times in initialization tasks demonstrates how popularity-based learning diverges from technical optimality.
For software development teams relying on LLM code generation, this research validates concerns about code quality beyond correctness metrics. Developers may receive functionally valid but suboptimal solutions that incur technical debt. The implications extend to DevOps, infrastructure planning, and resource allocation when LLM-generated code introduces unnecessary computational overhead through inappropriate library choices.
Future development requires three parallel efforts: improved training data curation with deliberate language and library diversity, fine-tuning strategies that reward architectural appropriateness, and new benchmarks explicitly measuring design-choice fidelity. Organizations evaluating LLM code generation tools should request visibility into these metrics rather than relying solely on standard correctness benchmarks.
- →LLMs overuse popular libraries like NumPy in 45% of cases despite ground-truth solutions not requiring them.
- →Python dominates LLM code generation across contexts where it is technically suboptimal for the task.
- →Current LLM evaluation frameworks fail to measure design choices, only functional correctness.
- →Training data skew toward Python and popular libraries perpetuates bias in model outputs.
- →Targeted fine-tuning and diverse training data are essential to improve language and library selection quality.