AINeutralarXiv – CS AI · 7h ago6/10
🧠
Value-Free Policy Optimization via Reward Partitioning
Researchers introduce Reward Partition Optimization (RPO), a new method for training language models that eliminates the need for value function estimation in preference-based learning. RPO simplifies the optimization process by normalizing rewards through partition-based formulations, demonstrating superior performance compared to existing approaches like DRO and KTO across multiple model architectures.