Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
Researchers propose Canonical-Context On-Policy Distillation (CCOPD), a training method that improves large language models' ability to solve problems when information is revealed incrementally across multiple conversation turns rather than all at once. By using a frozen teacher model with complete context to guide a student model receiving fragmented information, CCOPD achieves 32% relative performance improvement on multi-turn tasks while maintaining single-prompt performance.
The research addresses a fundamental limitation in current LLM behavior: models that excel with complete information often struggle when the same data arrives gradually through conversation. This gap stems from 'self-anchored drift,' where incomplete-information responses introduce unsupported assumptions that compound through subsequent turns and contaminate final answers. CCOPD tackles this through a novel knowledge distillation approach where the same model serves dual roles—one pristine teacher operating on full context, one trainable student learning from fragmented multi-turn inputs. The method forces alignment between student trajectories and teacher behavior, strengthening evidence grounding throughout conversations.
This work emerges from growing recognition that real-world LLM deployment differs fundamentally from academic benchmarks. Users interact conversationally, revealing information gradually, yet models trained primarily on single-prompt tasks show significant performance degradation. The approach builds on established distillation techniques but applies them creatively to the temporal dimension of conversations. Training exclusively on math problems yet achieving 32% improvements across five zero-shot out-of-domain tasks suggests the method captures general principles about consistent reasoning across partial information states.
For AI practitioners and model developers, CCOPD represents a practical training technique addressing deployment realities without requiring architectural changes or massive additional compute. The preservation of full-context performance while improving sharded performance indicates no fundamental trade-off. The framework potentially applies to any multi-turn application—customer support, research assistance, iterative problem-solving—where reasoning quality depends on tracking cumulative evidence across conversation history.
- →CCOPD improves multi-turn performance 32% on average while maintaining single-prompt capability, addressing real-world conversational LLM limitations.
- →The method uses same-model teacher-student distillation to align fragmented information handling with complete-context reasoning patterns.
- →Training on math conversations alone produces generalizable improvements across five zero-shot out-of-domain task families.
- →Self-anchored drift—assumptions from partial information contaminating final answers—emerges as a key explanatory mechanism for multi-turn failures.
- →The technique requires no architectural changes and strengthens grounding in user evidence while reducing assistant-turn contamination sensitivity.