Trajectory Supervision for Continual Tool-Use Learning in LLMs
Researchers demonstrate that preserving API request/response trajectories during continual learning significantly improves tool-use performance in language models. Fine-tuning Llama 3.1 8B on sequential API domains shows trajectory supervision achieves 56.9% accuracy versus 39.2% without intermediate context, though at a 25.1% token cost increase.
This research addresses a fundamental challenge in training language models for practical tool use: whether exposing intermediate steps during learning improves downstream performance. The study leverages API-Bank datasets across four sequential domain blocks, comparing two training approaches where one condition strips API call history while the other preserves the full trajectory context. The 17.7 percentage point accuracy gap between conditions is substantial enough to warrant deeper investigation into trajectory-based supervision as a training paradigm.
The work connects to broader trends in LLM fine-tuning and reasoning tasks. Recent advances in chain-of-thought prompting and step-by-step reasoning demonstrate that models benefit from exposure to process, not just outcomes. This research quantifies that principle specifically for tool use, where API interactions require sequential decision-making. The API-Bank domain structure provides a controlled environment to study continual learning without catastrophic forgetting, relevant to production scenarios where models must adapt to new tools incrementally.
For AI developers and practitioners building tool-augmented systems, the results suggest that trajectory supervision could improve agent reliability and reduce costly API errors. However, the 25% token overhead and single-seed limitation temper these implications. Organizations deploying continual learning systems face a practical tradeoff between accuracy and computational efficiency. The focus on next-call prediction rather than full dialogue success leaves open questions about whether trajectory benefits persist in realistic multi-turn interactions where earlier mistakes compound. Future work examining multi-seed validation and end-to-end dialogue performance will determine whether this approach scales to production deployment.
- βPreserving API trajectories during training improves exact call accuracy by 17.7 percentage points compared to stripped context.
- βTrajectory supervision increases training token consumption by 25.1%, creating efficiency-accuracy tradeoffs for deployment.
- βResults are limited to single-seed pilot with next-call prediction; full dialogue and multi-seed validation needed for broader claims.
- βTool-use continual learning benefits from intermediate step exposure, aligning with chain-of-thought reasoning research.
- βAPI-based systems training could benefit from trajectory-aware fine-tuning to reduce errors in sequential decision-making tasks.