When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
Researchers identify a critical bias in Bradley-Terry loss, the standard objective for training reward models in LLM alignment, where gradient magnitudes are distorted by representation distance rather than prediction error. They propose NormBT, a lightweight normalization scheme that refocuses learning on actual ranking mistakes, demonstrating 5%+ improvements on fine-grained reasoning benchmarks.
The paper addresses a fundamental issue in reward model training that has remained largely unexamined despite its widespread use in RLHF pipelines. The Bradley-Terry loss gradient is shown to have two competing forces: prediction error (the intended signal) and representation distance (a spurious bias). This means models can receive weak updates for genuinely wrong predictions if the representations are similar, while receiving inflated updates for trivial distinctions when representations diverge. The problem is particularly acute in fine-grained reasoning tasks where subtle distinctions matter most.
This research emerges as the AI community deepens its focus on reward model quality, recognizing that downstream RLHF performance is tightly coupled to how well reward models rank responses. Prior work has focused on data quality and model scaling, but this analysis reveals an architectural vulnerability in the training objective itself. The representation distance bias creates a hidden mechanism that can misdirect optimization efforts toward easier, less meaningful distinctions.
NormBT's practical impact stems from its simplicity and broad applicability. By adaptively rescaling gradient magnitudes based on representation distance, it eliminates the spurious weighting scheme while maintaining training stability. The 5%+ gains on RewardBench's Reasoning category—which contains challenging comparative pairs—suggest meaningful improvements in model discrimination ability. For organizations training reward models, this represents a straightforward optimization that requires minimal computational overhead and no architectural changes.
- →Bradley-Terry loss gradient magnitude depends on both prediction error and representation distance, creating spurious learning signals that misalign training
- →Small-distance representation pairs receive weak updates even when misranked, while large-distance pairs receive disproportionately strong updates
- →NormBT normalizes gradient updates to focus learning solely on prediction error, eliminating representation-driven bias
- →Reward model performance improves 5%+ on fine-grained reasoning benchmarks with NormBT, indicating better discrimination of subtle distinctions
- →The fix is a lightweight, drop-in replacement for standard BT loss with negligible computational overhead