AINeutralarXiv – CS AI · 11h ago7/10
🧠
DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory (and its Loss' Convexity is Dispensable)
Researchers present a theoretical framework that generalizes Direct Preference Optimization (DPO) by connecting it to foundational human choice theory, demonstrating that DPO's loss function need not be convex and that various machine learning approaches can be compatible with different human choice models. This work provides a normative foundation for preference optimization algorithms used in training large language models.