AIBullisharXiv – CS AI · 6h ago7/10
🧠
Optimal Transport for LLM Reward Modeling from Noisy Preference
Researchers introduce SelectiveRM, an optimal transport-based framework that improves reward model training for large language models by handling noisy preference data. The approach uses joint consistency discrepancy and partial transport mechanisms to automatically filter out contradictory samples, theoretically optimizing cleaner risk bounds and outperforming existing methods.