CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback
Researchers propose Credit-Attenuated Privileged Feedback (CAPF), a training mechanism that guides LLM search agents by providing verifier feedback during training to improve learning on difficult problems. The approach improves performance on open-domain QA benchmarks by leveraging information already available in reinforcement learning systems, increasing exact-match accuracy from 44.7% to 48.5% on Qwen3-4B.
CAPF addresses a fundamental challenge in training AI agents: the sparse reward problem. When LLM search agents attempt complex reasoning tasks, they rarely generate successful end-to-end solutions naturally, leaving reinforcement learning systems with insufficient positive examples to learn from effectively. By making verifier information available during training, CAPF enables agents to understand where their reasoning failed and attempt repairs, transforming zero-reward trajectories into learning opportunities.
This research builds on recent advances in LLM reasoning and reinforcement learning from verifiable rewards, a paradigm gaining traction as AI systems tackle increasingly complex tasks. The insight that systems already contain useful training signals that remain untapped represents an efficiency gain—CAPF doesn't require new infrastructure, just better utilization of existing verification mechanisms.
The practical impact centers on improving small models' reasoning capabilities. A 3.8 percentage-point improvement in exact-match accuracy across seven benchmarks suggests that privileged feedback during training could enable smaller, more deployable models to match larger competitors on knowledge-intensive tasks. This matters for cost-sensitive applications and edge deployment scenarios where model size constraints are real.
The crucial detail is credit attenuation: the mechanism weights feedback contributions appropriately so the policy doesn't become dependent on information unavailable at deployment. This maintains the gap between training and inference, preventing the agent from learning to rely on assistance it won't receive in production. Future work likely explores scaling this approach and applying it to other structured reasoning problems beyond QA.
- →CAPF leverages existing verifier information to create learning opportunities from failed attempts during training
- →Exact-match accuracy improved 3.8 points (44.7% to 48.5%) on open-domain QA benchmarks
- →Credit attenuation prevents models from depending on training-only feedback signals
- →The approach improves reasoning for smaller models, enabling more deployable AI systems
- →No new infrastructure required—uses information already present in reinforcement learning frameworks