When Context Returns: Toward Robust Internalization in On-Policy Distillation
Researchers identify a critical failure mode in on-policy distillation where reintroducing privileged context (like system prompts) to a distilled student model degrades performance, even on previously solved tasks. They propose a lightweight consistency regularizer using stop-gradient anchoring and forward KL divergence to achieve 'context removability,' enabling models to internalize context while remaining stable when it reappears.
This research addresses a fundamental challenge in knowledge distillation: the brittleness of internalized knowledge when environmental conditions change. The phenomenon of context-induced degradation reveals that current distillation approaches optimize narrowly for no-context performance without accounting for robustness across varying input conditions. This matters because deployed AI systems frequently encounter context shifts—whether through user modifications, system updates, or prompt engineering—and models that fail gracefully under these conditions create liability and limit practical deployment.
The paper builds on established distillation literature but identifies an overlooked failure mode that explains why internalized knowledge sometimes appears unstable. Prior work focused on matching teacher behavior during distillation, but ignored the stability properties needed when conditions revert or change. The proposed solution elegantly leverages stop-gradient techniques and KL divergence to anchor model outputs, requiring minimal computational overhead while providing measurable improvements across twelve diverse configurations.
For AI practitioners and model developers, this work has direct implications for production systems. Models that can reliably internalize context while remaining robust to its removal are more deployable and maintainable. The method's lightweight nature—requiring only one additional forward pass—makes adoption straightforward without computational burden. The mechanistic analysis confirming that context removability operates at the representation level suggests the approach targets root causes rather than symptoms, improving generalization properties of distilled models.
Future research should explore whether these principles extend to other forms of knowledge transfer and whether similar consistency regularization helps with other distribution shifts in machine learning systems.
- →Context-induced degradation causes distilled models to perform worse when original privileged context is reintroduced, even on previously solved tasks.
- →A consistency regularizer using stop-gradient anchoring and forward KL divergence effectively prevents performance degradation with minimal computational overhead.
- →The method improves context-conditioned accuracy in most settings and reduces context-induced harm in 11 of 12 test configurations.
- →Mechanistic analysis shows context removability is achieved at the representation level where hidden states remain nearly identical regardless of context presence.
- →The approach eliminates response-length inflation, a common side effect of context-conditioned model outputs.