Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
Researchers introduce ASTOR, a multi-task reinforcement learning framework that trains a single code LLM across multiple coding tasks more efficiently than task-specific models. By dynamically prioritizing training data and adjusting optimization constraints based on task utility, ASTOR achieves 9.0-9.5% performance gains over specialized models and 7.5-12.8% improvements over existing multi-task approaches.
ASTOR addresses a fundamental efficiency problem in deploying code LLMs: the need to maintain separate specialized models for different coding tasks. The framework's innovation centers on task utility—a metric that captures both individual task learning potential and synergies between tasks. This enables intelligent resource allocation that standard multi-task approaches miss.
The research builds on recent progress in RL-based LLM post-training, where verifiable rewards from code execution have proven highly effective. However, scaling this approach across multiple tasks traditionally requires either redundant model copies or crude averaging strategies that treat all tasks identically. ASTOR's two-module design addresses this gap: the hierarchical data scheduling module prioritizes which training examples to use, while the adaptive policy optimization module tailors optimization constraints to each task's current learning state.
For practitioners deploying code LLMs in production, this work has significant implications. Unified models reduce computational overhead and memory requirements compared to maintaining task-specific specialists, while achieving superior performance. The 7.5-12.8% performance gap over existing multi-task baselines suggests substantial room for improvement in how training resources are allocated across diverse objectives.
The framework's reliance on task utility as a guiding signal opens avenues for extending this approach beyond coding tasks. Future work likely explores whether similar utility-driven coordination applies to broader language model applications, and how to automatically derive task utility metrics without manual specification.
- →ASTOR unifies multi-task code LLM training by dynamically prioritizing data and adjusting per-task optimization based on task utility signals
- →Single ASTOR model outperforms task-specific specialists by 9.0-9.5% and existing multi-task baselines by 7.5-12.8%
- →Hierarchical data scheduling and adaptive KL regularration address the core limitation of uniform treatment across diverse coding tasks
- →Framework reduces computational costs by eliminating need for multiple specialized models while improving performance
- →Task utility metric capturing learning potential and cross-task synergy enables more efficient resource allocation than fixed curricula