🧠 AI⚪ NeutralImportance 7/10

Reward Models Inherit Value Biases from Pretraining

arXiv – CS AI|Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska|March 3, 2026 at 05:00 AM|3 views

🤖AI Summary

A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.

Key Takeaways

→Reward models used for AI alignment inherit substantial value biases from their underlying pretrained language models.
→Llama-based reward models consistently exhibit preference for 'agency' values while Gemma-based models prefer 'communion' values.
→These value differences persist even when using identical preference data and finetuning processes across different base models.
→The biases can be traced back to log-probability differences in the original pretrained and instruction-tuned models.
→Open-source AI developers' choice of base model represents a fundamental values decision, not just a performance consideration.