LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation
LARK introduces a learnability-grounded approach to trajectory selection for reasoning distillation, enabling student models to learn more efficiently from teacher-generated reasoning paths. The method uses a learnability factor to identify trajectories that maximize learning speed while maintaining distributional coverage, outperforming existing heuristic-based selection methods across multiple reasoning tasks.
LARK addresses a fundamental inefficiency in reasoning distillation pipelines where not all teacher-generated trajectories contribute equally to student model development. Traditional selection methods rely on surface-level metrics like trajectory quality or model confidence scores, which fail to account for whether a student model can actually learn from a given example efficiently. This research introduces a principled alternative grounded in learning theory.
The core innovation centers on a learnability factor (Ο) that measures how quickly a student's training loss decreases on specific trajectories. Rather than treating all high-quality trajectories equally, LARK recognizes that some examples align better with a student's current learning capability. This represents a shift from one-size-fits-all data selection toward adaptive, learner-centric approaches.
For practitioners developing reasoning-based AI systems, this has practical implications. LARK's ΟΒ²-regularized selection policy balances immediate learnability gains against maintaining distributional diversity, preventing the model from converging on narrow solution patterns. The theoretical guarantees on estimation error provide confidence that the method generalizes reliably across different base models and task domains.
The empirical results consistently demonstrate faster supervised fine-tuning loss reduction with LARK-selected trajectories, suggesting meaningful computational savings in large-scale model training pipelines. As reasoning capabilities become increasingly important for foundation models, efficient distillation methods that reduce training overhead while preserving performance become strategically valuable for organizations developing language models.
- βLARK selects training trajectories based on student learnability rather than trajectory quality alone, improving learning efficiency.
- βA learnability factor (Ο) quantifies how quickly a student model's loss decreases on specific examples.
- βΟΒ²-regularized selection policy balances learning speed with distributional coverage to prevent overfitting to narrow solution patterns.
- βMethod demonstrates consistent improvements over baseline approaches across multiple base models and reasoning tasks.
- βTheoretical guarantees on estimation error provide reliable generalization across different model architectures and domains.