🧠 AI🟢 BullishImportance 7/10

Pull Requests as a Training Signal for Repo-Level Code Editing

arXiv – CS AI|Qinglin Zhu, Tianyu Chen, Shuai Lu, Lei Ji, Runcong Zhao, Murong Ma, Xiangxiang Dai, Yulan He, Lin Gui, Peng cheng, Yeyun Gong|June 1, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Clean-PR, a training methodology that leverages 2 million real-world GitHub pull requests to improve AI models' ability to perform repository-level code editing. The approach achieves significant performance gains on SWE-bench benchmarks without relying on complex agent scaffolding, demonstrating that code editing capabilities can be effectively internalized into model weights through high-quality training signals.

Analysis

The research addresses a critical challenge in AI-assisted software development: enabling models to understand complex codebases and execute precise multi-file modifications. Traditional approaches rely on sophisticated agent scaffolding during inference, which adds computational overhead and complexity. Clean-PR represents a paradigm shift by internalizing these capabilities directly into model weights through mid-training, effectively teaching the model rather than directing it at inference time.

This work builds on the broader trend of using real-world data as training signals for specialized tasks. The conversion of noisy GitHub pull requests into structured Search/Replace edit blocks creates the largest publicly available corpus of its kind—2 million PRs across 12 programming languages. This dataset represents genuine developer intent and real-world code patterns, providing substantially higher quality signal than synthetic or annotated data.

The performance improvements—13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified—are substantial and demonstrate practical value. This approach could significantly accelerate AI code editing tools, reducing the infrastructure requirements for deployment while improving reliability. For developers, this means faster, more efficient AI-assisted coding assistants. For organizations building code AI systems, the methodology offers a scalable path to competitive performance without engineering-intensive agent systems.

The agentless-aligned supervision approach with error-driven data augmentation suggests a trend toward simpler, more efficient AI systems that rely on better training data rather than complex runtime orchestration. Future work likely involves exploring similar training paradigms for other complex tasks requiring multi-step reasoning and precise execution.

Key Takeaways

→Clean-PR methodology converts 2 million GitHub pull requests into structured training data for repository-level code editing tasks.
→Models trained with this approach outperform instruction-tuned baselines by 12-13% on SWE-bench benchmarks without requiring complex agent scaffolding.
→The approach internalizes code editing capabilities into model weights through mid-training rather than relying on inference-time complexity.
→Real-world pull request data provides higher-quality training signals than synthetic alternatives for code understanding tasks.
→Simpler, weight-based models may eventually replace complex agent systems for code editing applications.