Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA
Researchers introduce Code-QA-Bench, an automated framework that generates repository-level code understanding benchmarks while distinguishing genuine code comprehension from documentation recall. Testing four frontier AI models reveals that code access is the primary driver of performance, while documentation provides marginal benefits, suggesting current models excel at code reasoning when source material is available.
Code-QA-Bench addresses a critical evaluation gap in large language model benchmarking by separating actual code understanding from memorization effects. The framework's answer-first methodology—where agents explore code before deriving questions—ensures tasks reflect genuine code structure rather than documentation patterns. This approach matters because previous benchmarks risk measuring what models have memorized from training data rather than what they can genuinely reason about.
The three-condition experimental design reveals important insights about model capabilities. The +0.23 mean performance gain from code access versus closed-book conditions demonstrates that contemporary AI models heavily depend on seeing actual source code, not abstract knowledge. Documentation's modest +0.071 boost on doc-dependent tasks suggests that while supplementary materials help, they are not the primary factor driving code comprehension performance. The near-equivalence between code-only and fully documented settings validates the framework's ability to isolate code reasoning from documentation utility.
For the AI development community, these findings have practical implications for model training and evaluation strategies. Developers can now construct more rigorous benchmarks that prevent misleading performance claims based on memorization rather than reasoning ability. The open-source framework's applicability to any well-documented Python repository creates opportunities for standardized, reproducible evaluation across diverse codebases. As AI models increasingly target software engineering tasks, establishing clear measurement boundaries between documentation recall and genuine code comprehension becomes essential for understanding model limitations and guiding improvement efforts.
- →Code access is the dominant performance factor, yielding +0.23 mean improvement over closed-book conditions
- →Documentation provides modest supplementary benefit (+0.071) on documentation-dependent tasks
- →Code-only performance nearly matches fully documented conditions, validating the separation methodology
- →Answer-first generation ensures all tasks ground in real code structure, eliminating artificial benchmarks
- →Framework is open-source and applicable to any well-documented Python repository for standardized evaluation