🧠 AI🔴 BearishImportance 7/10

Misaligned by Reward: Socially Undesirable Preferences in LLMs

arXiv – CS AI|Gayane Ghazaryan, Esra D\"onmez|May 7, 2026 at 04:00 AM

🤖AI Summary

Researchers found that reward models used to align large language models often fail to capture socially desirable preferences, preferring biased, unsafe, or unethical responses across domains like bias, safety, and morality. The study reveals a critical misalignment between how reward models are currently evaluated and their actual performance on social intelligence tasks, exposing a fundamental gap in LLM safety infrastructure.

Analysis

Current approaches to evaluating reward models—the systems that guide LLM training toward human preferences—rely heavily on broad instruction-following benchmarks that mask failures in social alignment. This research extends evaluation into four consequential domains: bias, safety, morality, and ethical reasoning, directly testing whether reward models actually prefer socially desirable outputs. The methodology converts social datasets into pairwise preferences, creating a more targeted assessment framework than existing approaches.

The findings are concerning for the AI industry. Across five publicly available reward models, researchers discovered substantial variation with no clear winner, and critically, models frequently prefer socially undesirable options while producing systematically biased output distributions. This undermines confidence in current alignment techniques that rely on these reward models as training proxies. The research also identifies a troubling trade-off: stronger bias avoidance reduces contextual sensitivity, suggesting that attempts to patch one alignment problem may introduce another.

For developers and companies deploying LLMs, this research signals that current reward-model-based alignment methods require significant refinement. The gap between acceptable performance on standard benchmarks and actual social preferences means deployed models may exhibit biases or generate harmful content despite appearing aligned during evaluation. The industry faces pressure to develop more comprehensive evaluation frameworks that measure social preferences directly rather than inferring them from broad instruction-following metrics.

Future work must focus on creating reward models that simultaneously maintain contextual awareness while avoiding social harms—a non-trivial engineering challenge. The research suggests that single-metric optimization for LLM alignment is insufficient and that multi-dimensional evaluation frameworks accounting for social consequences are essential before further scaling.

Key Takeaways

→Reward models frequently prefer socially undesirable responses across bias, safety, morality, and ethical reasoning domains despite strong performance on standard benchmarks
→Current LLM evaluation frameworks are insufficient for assessing true social alignment and miss critical failures in preference capture
→A fundamental trade-off exists between reducing bias and preserving contextual faithfulness, creating engineering challenges for alignment
→No single reward model performs best across all social domains, indicating the need for multi-dimensional evaluation approaches
→Standard instruction-following benchmarks fail to reveal important social alignment gaps that could impact deployed AI systems