FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Researchers introduce Sol-RL, a two-stage reinforcement learning framework that combines FP4 quantization for efficient rollout generation with BF16 precision for policy optimization in diffusion models. The approach achieves up to 4.64x training acceleration while maintaining alignment quality, addressing the computational bottleneck of scaling RL-based post-training on large foundational models like FLUX.1.
Sol-RL represents a meaningful advancement in making large-scale diffusion model training more computationally accessible. The research addresses a genuine constraint: while increasing rollout group sizes in RL-based alignment produces better results, the computational cost becomes prohibitive for most practitioners working with 12B+ parameter models. The framework's core insight—decoupling the exploration phase from optimization—elegantly sidesteps the traditional efficiency-quality tradeoff.
This work builds on the established trend of applying reinforcement learning to post-train generative models toward human preferences, following successes in text-to-image systems. The technical contribution matters because quantization-aware training has historically risked degrading model outputs when naively applied. By using FP4 only for candidate generation and selectively regenerating promising samples in higher precision, Sol-RL preserves training integrity while capturing efficiency gains.
The broader impact affects both researchers and commercial developers. Reducing training costs by 4.64x democratizes experimentation with alignment techniques that were previously cost-prohibitive. This could accelerate innovation cycles for smaller teams and reduce environmental impact through lower computational requirements. The validation across multiple architectures (SANA, FLUX.1, SD3.5-L) suggests the approach generalizes well rather than being model-specific.
The technique signals how future large model development may balance performance with practical constraints. As models continue scaling, algorithmic innovations that reduce computational overhead become increasingly valuable. The combination of hardware-level optimizations (NVFP4) with algorithmic design demonstrates the power of co-optimizing systems and methods rather than treating them separately.
- →Sol-RL achieves 4.64x training acceleration by combining FP4 rollouts with BF16 optimization, maintaining quality while reducing computational overhead.
- →Two-stage design generates massive candidate pools in low precision, then selectively regenerates and trains on high-precision samples, decoupling exploration from optimization.
- →Framework validated across SANA, FLUX.1, and SD3.5-L diffusion models, demonstrating broad applicability beyond specific architectures.
- →Approach makes RL-based alignment training more accessible by reducing computational barriers that previously limited small teams and researchers.
- →Synergistic algorithm-hardware design leverages NVFP4 throughput gains while preserving training integrity through selective high-precision optimization.