AINeutralarXiv โ CS AI ยท 8h ago6/10
๐ง
PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning
Researchers introduce PRISM, a three-stage training pipeline that addresses distributional drift in large multimodal models by inserting a distribution-alignment stage between supervised fine-tuning and reinforcement learning. The method uses a Mixture-of-Experts discriminator to correct perception and reasoning errors, achieving 4.4-6.0 percentage point improvements on multimodal benchmarks compared to standard SFT-to-RLVR approaches.
๐ง Gemini