RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations
Researchers introduce RepoMirage, an evaluation suite that tests whether code agents truly understand repository context by applying perturbations to challenge their reasoning abilities. The study reveals a significant gap in how agents handle complex, multi-file code tasks, with performance dropping from 66.8% to 25.3% when explicit structural understanding is required.
RepoMirage addresses a critical blind spot in AI code agent evaluation. While tools like Claude and GPT-4 demonstrate strong performance on standard benchmarks like SWE-Bench, the research questions whether success reflects genuine repository reasoning or exploits superficial task patterns. The two-stage evaluation methodology is methodologically sound: initial perturbations expose context sensitivity, while extended tasks isolate structural understanding gaps.
The performance collapse from 66.8% to 25.3% is striking and reveals a fundamental architectural limitation. Code agents access broader repository context but fail to synthesize it into actionable structure models. This exploration drift pattern suggests agents retrieve files without building coherent mental maps of codebase architecture, akin to reading documents without understanding their relationships.
The proposed RepoAnchor workflow—separating exploration from problem-solving—mirrors human developer practices where understanding architecture precedes implementation. This structure-first approach achieved notable gains, indicating the path forward involves explicit scaffolding rather than black-box scaling.
For the AI development community, these findings matter significantly. As code generation moves toward autonomous repository-level tasks, understanding these reasoning gaps becomes critical. The work suggests that merely increasing model size or token context windows cannot overcome structural comprehension limitations. Future systems require deliberate architectural changes that prioritize semantic understanding of codebase topology, not just file retrieval.
- →Code agents show 60% performance drops when repository context perturbations increase reasoning demands, indicating superficial task understanding
- →Agents retrieve relevant files but fail to build coherent structural models, exhibiting exploration drift without effective synthesis
- →Structure-first scaffolding separating exploration from problem-solving yields measurable improvements over end-to-end approaches
- →Current benchmarks may overestimate agent capabilities by not adequately testing multi-file reasoning and architectural understanding
- →Repository context reasoning requires explicit structural awareness, which cannot be solved through scaling alone