From Fragments to Paths: Task-Level Context Recovery for Large Industrial Codebases
Researchers introduce DeepDiscovery, an AI method that improves how large language models understand complex industrial codebases by recovering task-relevant context across multi-relational repository structures. The system demonstrates significant performance improvements on software engineering tasks, achieving 78.6% solve rate on SWE-bench Verified and gains of 1.6-9.2 percentage points in real production environments.
DeepDiscovery addresses a critical limitation in current AI-assisted software engineering: while large language models excel at isolated coding tasks, they struggle with the contextual understanding required for complex repository-level work. The research reveals that existing retrieval methods often capture only local code fragments, missing the interconnected relationships and broader context necessary for sophisticated engineering decisions. This two-stage Location-Inference framework systematically localizes high-confidence task anchors before expanding to recover relevant context, operating within practical computational budgets.
The performance metrics demonstrate substantial real-world impact. On production-scale codebases from an organization-internal ecosystem, DeepDiscovery improved full recall rates across multiple AI coding systems by measurable margins. The 78.6% solve rate on SWE-bench Verified represents an 8.2 percentage point improvement over baseline approaches, suggesting that enhanced repository understanding directly translates to more effective AI coding agents.
This advancement has immediate implications for enterprise software development and AI coding assistants. As companies deploy AI tools for code generation and modification, understanding complex industrial repositories becomes increasingly valuable. Better context recovery enables more accurate code suggestions, fewer hallucinations, and higher-quality automated engineering solutions. The method's effectiveness without offline preprocessing makes it practical for real-world deployment.
Looking forward, this work establishes repository understanding as a key differentiator for AI coding platforms. Future developments might focus on scaling these techniques to even larger codebases, integrating dynamic context based on task evolution, or combining this approach with multimodal understanding of documentation and architecture diagrams.
- βDeepDiscovery uses a two-stage Location-Inference framework to recover task-relevant context from industrial codebases more effectively than local-fragment retrieval methods.
- βReal-world testing on production-scale repositories showed 1.6-9.2 percentage point improvements in full recall rate across multiple AI coding systems.
- βThe method achieved 78.6% solve rate on SWE-bench Verified, an 8.2 percentage point improvement over comparable baselines.
- βDeepDiscovery operates without requiring offline preprocessing, making it practical for deployment in enterprise environments.
- βEnhanced repository understanding directly improves coding-agent performance on complex software engineering tasks requiring multi-file context.