Steer, Don't Solve: Training Small Critic Models for Large Code Agents
Researchers developed a small critic model that guides large code agents during execution rather than evaluating completed work, reducing computational costs while improving performance. The approach achieves 25.2% accuracy on SWE-bench Verified at 64% lower expense than larger agents, demonstrating that supplementing agent training with efficient feedback mechanisms outperforms scaling alone.
The research addresses a fundamental inefficiency in code agent development: end-to-end training conflates two distinct optimization problems. Large agents struggle to balance code-level execution with strategy-level reasoning, and joint optimization leaves higher-order decision-making underdeveloped. By freezing the agent and introducing a small critic model that provides real-time feedback during trajectory execution, the team decouples these concerns and achieves superior results with dramatically reduced overhead.
This work builds on growing recognition that bigger models aren't always better. The AI research community has increasingly explored how smaller, specialized models can enhance larger systems through targeted feedback loops rather than raw scale. Traditional post-hoc critics score completed attempts, missing opportunities to steer agents away from unproductive paths mid-trajectory. The authors' intra-trajectory approach mirrors human problem-solving, where intermediate course-corrections prevent wasted effort.
The practical implications are significant for both AI labs and commercial applications. A critic-guided Qwen3-Next-80B agent costs $0.04 per inference compared to $0.11 for the unguided version while achieving superior accuracy—a compelling value proposition. The critic transfers effectively across unseen agents, suggesting broad applicability. Companies deploying code agents face mounting operational costs; this technique offers tangible savings without sacrificing quality.
The open-source release and model availability enable rapid adoption and validation. Future work likely explores scaling critic models intelligently, adapting critics across different agent architectures, and applying similar steering principles to other complex reasoning tasks. This research exemplifies how thoughtful architectural choices and efficient training strategies can compete with brute-force scaling in specialized domains.
- →Small critic models steer agents during execution, achieving 25.2% accuracy at 64% lower cost than larger unguided agents on code tasks.
- →Intra-trajectory feedback via supervised fine-tuning outperforms post-hoc trajectory scoring by enabling mid-course corrections.
- →Critics trained on one agent's trajectories transfer effectively to unseen agents with gains of +3.0 to +5.2 percentage points.
- →The critic-guided approach reduces inference costs from $0.11 to $0.04 while simultaneously improving accuracy and shortening trajectories.
- →Open-source release enables rapid adoption across diverse code agent architectures and use cases.