Extreme Region Policy Distillation
Researchers propose Extreme Region Policy Distillation (ERPD), a two-stage framework that improves reinforcement learning efficiency for large language models by first extracting maximum training signals through aggressive off-policy optimization, then distilling those signals into a base policy with tighter constraints. The approach achieves comparable or better performance with significantly reduced KL divergence, addressing a fundamental trade-off between sample efficiency and asymptotic performance in LLM training.