🧠 AI🟢 BullishImportance 6/10

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

arXiv – CS AI|Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi|April 14, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Skill-SD, a novel training framework for multi-turn LLM agents that improves sample efficiency by converting successful agent trajectories into dynamic natural language skills that condition a teacher model. The approach combines reinforcement learning with self-distillation and achieves significant performance improvements over baseline methods on benchmark tasks.

Analysis

Skill-SD addresses a fundamental challenge in training large language model agents: the inefficiency of reinforcement learning when rewards are sparse and task horizons are long. Traditional on-policy self-distillation provides dense supervision through privileged information, but static ground-truth answers cannot represent the variety of valid strategies agents might employ in complex, multi-step tasks. This research directly tackles training instability by introducing a dynamic supervision mechanism grounded in the agent's own successful experiences.

The framework's innovation lies in its abstraction layer: completed trajectories become compact natural language skill descriptions that capture both successful behaviors and failure patterns. These skills condition only the teacher model during training, while student models learn without explicit skill prompts, creating a knowledge transfer mechanism that feels more organic than traditional distillation. The importance-weighted reverse-KL loss ensures mathematically sound gradient computation at the token level, and dynamic teacher synchronization prevents divergence between teacher and student during learning.

The experimental results demonstrate substantial practical improvements: 14-10.9% gains over vanilla GRPO and 42.1-40.6% improvements over standard on-policy distillation on AppWorld and Sokoban benchmarks indicate the framework's effectiveness across different task types. These gains suggest that capturing learned strategies as intermediate representations significantly improves agent generalization and sample efficiency. The approach has implications for deployed agentic systems requiring faster, more reliable training with limited computational budgets.

Future developments will likely focus on scaling Skill-SD to more complex real-world tasks and integrating it with emerging multi-agent frameworks. The method's reliance on natural language summaries raises questions about how skill descriptions scale with task complexity and whether automated skill extraction can maintain quality at enterprise scales.

Key Takeaways

→Skill-SD converts agent trajectories into dynamic natural language skills that improve training efficiency without requiring static privileged information.
→The framework achieves 14-42% performance improvements over standard RL and on-policy distillation baselines on multi-turn task benchmarks.
→Importance-weighted reverse-KL loss provides mathematically sound token-level supervision and prevents training collapse during RL and distillation integration.
→Dynamic teacher synchronization prevents divergence between teacher and student models, stabilizing training across long-horizon interactive tasks.
→The approach enables agents to internalize diverse valid strategies by learning from their own successful trajectories rather than fixed ground-truth answers.