Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions
A comprehensive measurement study reveals that large language models frequently specify vulnerable and incompatible library versions in generated Python code, with 36.70%-55.70% of tasks containing known CVEs and 62.75%-74.51% rated as Critical or High severity. The research demonstrates this represents a systemic bias across all evaluated models rather than isolated errors, with most CVEs publicly disclosed before the models' knowledge cutoffs.
This study exposes a critical vulnerability in the AI-assisted development workflow that has received minimal attention despite widespread LLM adoption in software engineering. When developers rely on LLM-generated code, they inherit version dependencies that frequently contain known security flaws, creating a supply-chain risk at the dependency level rather than the code logic level. The research evaluated ten major LLMs against a benchmark of 1,000 Stack Overflow tasks, finding that between 36.70% and 55.70% of generated code snippets included at least one known CVE, with the vast majority rated as Critical or High severity.
The findings demonstrate this is not random incompetence but rather a systemic bias rooted in training data distribution. All tested models converge on the same small set of risky release versions, suggesting they learned problematic patterns from common code repositories and documentation. Notably, 72.27%-91.37% of the vulnerable CVEs were publicly disclosed before each model's training cutoff, meaning the vulnerability data existed during model development. Compatibility failures compound the security issue, with static analysis showing 19.70%-63.20% compatibility rates depending on the model.
For the developer and enterprise ecosystems, this creates an immediate security debt problem. Developers using LLM code generation without explicit version pinning practices risk deploying vulnerable dependencies at scale. The research demonstrates that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures, providing a practical mitigation path. This work establishes version selection as a previously overlooked risk surface in LLM-based development, demanding new tooling and workflow integration to validate LLM-generated dependency specifications before deployment.
- β36.70%-55.70% of LLM-generated code tasks contain at least one known CVE, with majority rated Critical or High severity
- βAll tested LLMs exhibit systemic bias toward the same risky library versions rather than random errors
- β72.27%-91.37% of vulnerable CVEs were publicly disclosed before model training cutoffs, indicating preventable security debt
- βStatic compatibility rates range 19.70%-63.20%, with version selection (not code quality) as the primary failure cause
- βExternally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures