PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning
Researchers introduce PRISM, a three-stage training pipeline that addresses distributional drift in large multimodal models by inserting a distribution-alignment stage between supervised fine-tuning and reinforcement learning. The method uses a Mixture-of-Experts discriminator to correct perception and reasoning errors, achieving 4.4-6.0 percentage point improvements on multimodal benchmarks compared to standard SFT-to-RLVR approaches.
PRISM addresses a fundamental challenge in post-training large multimodal models: the compounding errors introduced when supervised fine-tuning diverges from the original model's capabilities and supervision distribution. Standard practices apply SFT followed by reinforcement learning with verifiable rewards, but this sequential approach allows drift to accumulate, particularly problematic in multimodal contexts where perception and reasoning failures interact unpredictably.
The research builds on established principles of on-policy distillation to create an adversarial alignment stage that operates as a black-box response-level game. By deploying separate perception and reasoning experts within a Mixture-of-Experts discriminator, the method provides targeted corrective signals without requiring access to teacher model logits—a significant practical advantage for scaling with closed-source models like Gemini 3 Flash. The team's curation of 113K high-quality demonstrations featuring dense visual grounding and step-by-step reasoning on difficult problems reflects growing recognition that data quality matters as much as quantity in advanced AI training.
For the AI development community, PRISM demonstrates measurable improvements across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse benchmarks, suggesting the approach has genuine robustness. The public release of code, data, and checkpoints accelerates adoption potential. The work validates that explicit distribution alignment—rather than end-to-end optimization—can meaningfully improve multimodal model quality. As teams scale vision-language models and integrate them into production systems, techniques addressing distributional drift become increasingly valuable for maintaining performance consistency and reducing error compounding in complex reasoning tasks.
- →PRISM inserts a distribution-alignment stage between SFT and RLVR to reduce drift in multimodal models, improving accuracy by 4.4-6.0 percentage points.
- →The method uses a black-box adversarial game with Mixture-of-Experts discriminators to provide separate corrective signals for perception and reasoning errors.
- →Researchers curated 113K high-fidelity demonstrations from Gemini 3 Flash to support distribution alignment beyond initial SFT on 1.26M public examples.
- →PRISM demonstrates consistent improvements across multiple RL algorithms, indicating robustness and potential for broader adoption in multimodal model training.
- →Public release of code, data, and checkpoints enables rapid community adoption and validation of the distribution-alignment approach.