AINeutralarXiv – CS AI · 8h ago6/10
🧠
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
Researchers introduce Graph Direct Preference Optimization (GraphDPO), an advancement over standard DPO that leverages full preference structures from multiple rollouts per prompt rather than collapsing data into independent pairs. The method maintains computational efficiency while improving stability and performance on reasoning and program synthesis tasks by enforcing transitivity and reducing conflicting supervision signals.