AIBearisharXiv – CS AI · 7h ago7/10
🧠
Misaligned by Reward: Socially Undesirable Preferences in LLMs
Researchers found that reward models used to align large language models often fail to capture socially desirable preferences, preferring biased, unsafe, or unethical responses across domains like bias, safety, and morality. The study reveals a critical misalignment between how reward models are currently evaluated and their actual performance on social intelligence tasks, exposing a fundamental gap in LLM safety infrastructure.