AIBullisharXiv โ CS AI ยท 1d ago7/10
๐ง
Self-Distillation for Multi-Token Prediction
Researchers propose MTP-D, a self-distillation method that improves Multi-Token Prediction for Large Language Models, achieving 7.5% better acceptance rates and up to 220% inference speedup. The technique addresses key challenges in training multiple prediction heads while preserving main model performance.