y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

arXiv – CS AI|Jiaqiang Tang|
🤖AI Summary

Researchers present ToolGraph, a framework that improves multi-turn tool-using AI agents through self-evolution via preference learning. By combining schema-derived topology with divergence-point preference optimization, the system achieves 16.8% improvement over baseline performance on benchmark tasks, with gains concentrated in airline and retail domains.

Analysis

ToolGraph addresses a fundamental challenge in autonomous agent design: coordinating complex, multi-step tool sequences while maintaining coherent dialogue state and respecting operational constraints. Traditional approaches treat inference-time orchestration and parameter-level learning as separate processes, creating misalignment between training objectives and deployment realities. This research bridges that gap by integrating graph-based tool topology directly into the learning pipeline.

The technical contribution centers on identifying divergence points—critical decision moments where agent behavior branches toward success or failure—and constructing preference pairs around these junctures. By training direct preference optimization (DPO) under the same ToolGraph context used during inference, the method eliminates the prompt mismatch problem plaguing earlier approaches. The 11.2% improvement from ToolGraph alone demonstrates that better tool orchestration architecture matters; the additional 5.6% gain from DPO shows that aligned preference learning amplifies these gains.

For the AI agent ecosystem, this work signals maturation in how researchers approach multi-step reasoning. Rather than treating tool selection as a post-hoc wrapper around language models, ToolGraph embeds tool semantics into the learning objective itself. The concentrated improvements in airline and retail domains suggest domain-specific tool graphs yield outsized benefits, implying that specialized agent deployment could see meaningful capability jumps with similar techniques.

The diagnostic finding that half of telecom trajectories exhaust step budgets before execution reveals infrastructure constraints matter as much as algorithms. Future work likely focuses on scaling this approach across domains while optimizing computational efficiency, particularly for cost-sensitive applications.

Key Takeaways
  • ToolGraph achieves 16.8% improvement on benchmark tasks by integrating tool topology into preference learning rather than treating it as separate inference machinery.
  • Divergence-point identification enables precise preference pair construction, eliminating train-deployment prompt mismatch problems in prior multi-turn agent approaches.
  • Direct preference optimization gains concentrate in airline and retail domains, suggesting domain-specific tool graphs unlock larger capability improvements than general approaches.
  • Diagnostic analysis reveals step-budget exhaustion, not algorithmic failures, causes roughly 50% of telecom trajectory failures, pointing to infrastructure optimization as critical next step.
  • Chosen reward positivity emerges as the most reliable checkpoint signal across 16 DPO configurations, providing practitioners a concrete metric for model selection during training.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles