y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

arXiv – CS AI|Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh|
🤖AI Summary

Researchers identify a critical bias in Bradley-Terry loss, the standard objective for training reward models in LLM alignment, where gradient magnitudes are distorted by representation distance rather than prediction error. They propose NormBT, a lightweight normalization scheme that refocuses learning on actual ranking mistakes, demonstrating 5%+ improvements on fine-grained reasoning benchmarks.

Analysis

The paper addresses a fundamental issue in reward model training that has remained largely unexamined despite its widespread use in RLHF pipelines. The Bradley-Terry loss gradient is shown to have two competing forces: prediction error (the intended signal) and representation distance (a spurious bias). This means models can receive weak updates for genuinely wrong predictions if the representations are similar, while receiving inflated updates for trivial distinctions when representations diverge. The problem is particularly acute in fine-grained reasoning tasks where subtle distinctions matter most.

This research emerges as the AI community deepens its focus on reward model quality, recognizing that downstream RLHF performance is tightly coupled to how well reward models rank responses. Prior work has focused on data quality and model scaling, but this analysis reveals an architectural vulnerability in the training objective itself. The representation distance bias creates a hidden mechanism that can misdirect optimization efforts toward easier, less meaningful distinctions.

NormBT's practical impact stems from its simplicity and broad applicability. By adaptively rescaling gradient magnitudes based on representation distance, it eliminates the spurious weighting scheme while maintaining training stability. The 5%+ gains on RewardBench's Reasoning category—which contains challenging comparative pairs—suggest meaningful improvements in model discrimination ability. For organizations training reward models, this represents a straightforward optimization that requires minimal computational overhead and no architectural changes.

Key Takeaways
  • Bradley-Terry loss gradient magnitude depends on both prediction error and representation distance, creating spurious learning signals that misalign training
  • Small-distance representation pairs receive weak updates even when misranked, while large-distance pairs receive disproportionately strong updates
  • NormBT normalizes gradient updates to focus learning solely on prediction error, eliminating representation-driven bias
  • Reward model performance improves 5%+ on fine-grained reasoning benchmarks with NormBT, indicating better discrimination of subtle distinctions
  • The fix is a lightweight, drop-in replacement for standard BT loss with negligible computational overhead
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles