AIBullisharXiv – CS AI · 9h ago7/10
🧠
Rubric-based On-policy Distillation
Researchers introduce ROPD, a rubric-based on-policy distillation framework that replaces teacher logits with structured semantic rubrics for model alignment. The approach achieves up to 10x better sample efficiency than logit-based methods while enabling distillation from proprietary black-box LLMs, addressing a critical scalability limitation in current model training.