🧠 AI🟢 BullishImportance 7/10

Optimal Transport for LLM Reward Modeling from Noisy Preference

arXiv – CS AI|Licheng Pan, Haochen Yang, Haoxuan Li, Yunsheng Lu, Yongqi Tong, Yinuo Wang, Shijian Wang, Zhixuan Chu, Lei Shen, Yuan Lu, Hao Wang|May 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce SelectiveRM, an optimal transport-based framework that improves reward model training for large language models by handling noisy preference data. The approach uses joint consistency discrepancy and partial transport mechanisms to automatically filter out contradictory samples, theoretically optimizing cleaner risk bounds and outperforming existing methods.

Analysis

This research addresses a critical bottleneck in RLHF systems: the degradation of reward model quality when trained on inherently noisy human preference data. Reward models serve as the evaluative backbone for aligning LLMs with human intent, making their robustness essential to downstream performance. Current approaches either overfit to noisy labels or apply overly simplistic denoising that assumes uniform noise patterns—inadequate for capturing the nuance of linguistic preferences where subjective interpretation varies significantly.

SelectiveRM's innovation lies in reformulating the problem through optimal transport theory, a mathematical framework originally designed for comparing probability distributions. By introducing Joint Consistency Discrepancy, the framework aligns model predictions with preference distributions more rigorously. The Mass Relaxation mechanism via partial transport is particularly novel: rather than forcing models to explain all data points (strict mass conservation), it permits selective exclusion of semantically inconsistent samples. This design elegantly sidesteps the classical overfit-robustness tradeoff.

For the AI development ecosystem, this advances the technical foundation of RLHF pipelines, which power increasingly sophisticated model alignment. Improved reward models directly translate to higher-quality model outputs and more efficient training cycles. The theoretical contribution—proving optimization of tighter upper bounds on clean risk—provides principled justification beyond empirical validation.

The approach matters beyond academia: as organizations scale LLM deployment, training on high-quality datasets becomes economically critical. Better denoising reduces data collection costs and improves model reliability. This work potentially influences how teams approach preference annotation and reward model engineering in production settings.

Key Takeaways

→SelectiveRM uses optimal transport theory to handle noisy preference data in reward model training for LLMs.
→The framework automatically filters contradictory samples rather than forcing models to overfit to all labels.
→Joint Consistency Discrepancy aligns model predictions with preference distributions more rigorously than existing methods.
→Theoretical analysis proves the approach optimizes tighter bounds on unobserved clean risk.
→Framework demonstrates significant improvements across multiple benchmarks compared to state-of-the-art baselines.