Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling
Researchers introduce Graph-Distance Contribution Reward (GDCR), a novel step-level credit assignment method for agentic search that evaluates individual agent actions by measuring progress toward answer nodes in knowledge graphs. Combined with Step Advantage Policy Optimization (SAPO), this approach improves upon trajectory-level reward systems that cannot assess the quality of intermediate steps, showing strong results across multiple benchmarks.
This research addresses a fundamental challenge in training AI agents: determining which individual actions within a task sequence deserve credit for eventual success. Traditional agentic search systems assign rewards only at the trajectory level—meaning an agent knows whether it succeeded overall but cannot learn which specific steps were most valuable. This creates an inefficient learning signal, particularly problematic for complex reasoning tasks.
The proposed method models world knowledge as a latent graph structure where entities and relations form nodes and edges. By measuring how newly-retrieved or newly-cited entities move closer to the correct answer node, the system generates fine-grained credit signals without expensive tree-based sampling. This represents a meaningful efficiency gain in reinforcement learning for AI agents, reducing computational overhead while maintaining signal quality.
The SAPO framework elegantly bridges step-level and trajectory-level advantages, enabling agents to learn both immediate action quality and long-term outcome patterns. This hybrid approach addresses a key limitation in current agentic systems: they either lack granular feedback for intermediate steps or require prohibitive computational resources to generate it.
For the AI development community, this work has tangible implications. More efficient credit assignment accelerates the training of reasoning-based agents used in search, information retrieval, and question-answering systems. The methodology could extend to other domains requiring multi-step decision-making. Validation across four challenging benchmarks suggests practical applicability rather than theoretical elegance alone.
- →GDCR enables step-level credit assignment by measuring entity distance to answer nodes in knowledge graphs, avoiding expensive tree sampling
- →Step Advantage Policy Optimization combines step-level and trajectory-level rewards for more efficient agent training
- →The approach reduces computational overhead while maintaining or improving signal quality in agentic search tasks
- →Method validates across multiple benchmarks, suggesting broad applicability to reasoning-based AI systems
- →Graph-based modeling of world knowledge provides interpretable progress signals for individual agent actions