AIBullisharXiv – CS AI · 7h ago7/10
🧠
When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
Researchers identify a critical bias in Bradley-Terry loss, the standard objective for training reward models in LLM alignment, where gradient magnitudes are distorted by representation distance rather than prediction error. They propose NormBT, a lightweight normalization scheme that refocuses learning on actual ranking mistakes, demonstrating 5%+ improvements on fine-grained reasoning benchmarks.