Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL
Researchers introduce DIDR (Diff-Instruct with Diffused Reward), a reinforcement learning framework that improves one-step text-to-image generation by aligning reward optimization with diffusion dynamics. The method addresses a fundamental mismatch in existing approaches where optimizing for image-space rewards often degrades overall image fidelity, demonstrating superior results compared to current SDXL baselines.
The advancement addresses a critical inefficiency in generative AI development. One-step image generators represent a major breakthrough in computational efficiency, but existing reinforcement learning methods struggle with a fundamental tension: optimizing for specific reward signals in image space often conflicts with the underlying diffusion process's noise-to-image trajectory. Previous approaches would exploit stochastic degrees of freedom, achieving higher numerical rewards while paradoxically producing lower-quality images.
DIDR solves this by propagating reward-optimal distributions across all noise levels in the diffusion trajectory rather than optimizing only at the final step. Derived from integral KL minimization, this principled approach mathematically ensures alignment between the reward signal and generative dynamics. The introduction of Diffused Reward Score (DRS) as a learned correction to the reference score function, combined with the computationally efficient Diffused Reward Proxy (DRP), makes the approach practical at scale.
The empirical results demonstrate meaningful progress: DIDR consistently outperforms existing one-step baselines and remarkably surpasses its 50-step teacher model when applied to larger architectures like the 6B DiT backbone. This indicates that the framework doesn't merely match performance—it potentially redefines what single-step generation can achieve. For the AI development community, this suggests that principled optimization alignment with generative mechanics yields better results than brute-force reward maximization.
The research establishes a template for future reinforcement learning in diffusion models, likely influencing how researchers approach alignment problems in generative systems. The methodology's transferability across architectures suggests broader applicability beyond image generation.
- →DIDR eliminates the reward-fidelity tradeoff by aligning RL objectives with diffusion trajectory dynamics across all noise levels
- →Framework surpasses 50-step teachers using single-step generation, indicating substantial efficiency gains without quality loss
- →Data-free approach based on integral KL minimization provides principled foundation for reward propagation
- →Diffused Reward Proxy enables practical implementation through efficient differentiable short-step denoising
- →Results Pareto-dominate existing SDXL baselines across multiple evaluation metrics