AINeutralarXiv – CS AI · 9h ago6/10
🧠
Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model
Researchers present a new approach to aligning language models with human preferences that works without assuming a specific mathematical relationship between observed preferences and underlying rewards. The method frames policy alignment as a semiparametric optimization problem, enabling more robust policy learning even when the preference model structure is unknown or misspecified.