LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models
Researchers introduce LibEvoBench, a benchmark testing how well AI code generation models handle multiple versions of Python libraries. The study reveals that state-of-the-art LLMs struggle with version-specific API knowledge, making anachronistic errors when libraries evolve, though documentation significantly improves performance.
The research addresses a fundamental gap in how large language models handle real-world software development. Modern codebases frequently maintain dependencies on older library versions due to compatibility requirements and migration costs, yet current LLMs trained on temporally mixed data lack mechanisms to distinguish between API versions. This creates practical problems where models generate code using obsolete or future-incompatible function signatures.
LibEvoBench fills an important evaluation gap by systematically measuring how models perform across library versions rather than assuming single-point performance metrics apply universally. The Software Evolution Understanding Score (SEUS) metric tracks consistency degradation, revealing that models remain "version-oblivious" regardless of their overall sophistication. This finding exposes limitations in current training paradigms that blend temporal information without explicit version awareness.
The implications extend beyond academic research. Developers using AI-assisted coding tools may receive suggestions incompatible with their project's library versions, creating technical debt and security risks. Teams cannot rely on simply specifying target versions as a workaround—models ignore this context. However, the positive response to documentation suggests practical improvements are achievable through better training data curation rather than architectural changes.
This research highlights why AI code generation tools require domain-specific safety mechanisms beyond general language understanding. Organizations deploying these models in production should implement validation layers checking API compatibility, while model developers should prioritize version-aware training strategies. The gap between documentation's effectiveness and version-specification's ineffectiveness suggests future improvements may focus on retrieval-augmented generation and context-aware finetuning approaches.
- →State-of-the-art LLMs perform inconsistently across library versions despite being trained on mixed temporal data
- →Simply specifying target library versions in prompts provides no meaningful improvement to model accuracy
- →Providing relevant documentation significantly boosts model performance on version-specific code generation tasks
- →Current training paradigms lack explicit mechanisms for temporal and version-specific knowledge stratification
- →The findings motivate new approaches to training AI models with temporally grounded and version-aware knowledge