AIBullisharXiv – CS AI · 7h ago7/10
🧠
PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning
Researchers propose Predictive Routing Replay (PR2), a technique to stabilize reinforcement learning training on Mixture of Experts LLMs by predicting router evolution and reducing the mismatch between rollout and training phases. The method addresses router drift—a critical instability source in MoE-based models undergoing RL fine-tuning—through lightweight prediction mechanisms that anticipate expert activation changes.