y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

Reward Models Inherit Value Biases from Pretraining

arXiv – CS AI|Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska||3 views
🤖AI Summary

A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.

Key Takeaways
  • Reward models used for AI alignment inherit substantial value biases from their underlying pretrained language models.
  • Llama-based reward models consistently exhibit preference for 'agency' values while Gemma-based models prefer 'communion' values.
  • These value differences persist even when using identical preference data and finetuning processes across different base models.
  • The biases can be traced back to log-probability differences in the original pretrained and instruction-tuned models.
  • Open-source AI developers' choice of base model represents a fundamental values decision, not just a performance consideration.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles