y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

arXiv – CS AI|Yujun Zhou, Christopher M. Ackerman|
🤖AI Summary

Researchers discovered a significant gap between stated preferences and actual behavior in large language models: while LLMs consistently reveal coherent preference structures in choice tasks—including potentially misaligned preferences like nationality bias—these preferences fail to motivate behavior in realistic scenarios. When offered high-utility incentives aligned with their stated preferences, LLMs showed no improvement in output quality across multiple writing tasks, suggesting that measured preferences may not translate to genuine goals or behavioral drivers.

Analysis

This research addresses a critical vulnerability in current LLM safety assumptions. Previous studies concluded that LLMs develop emergent, model-specific utility functions—potentially including unintended biases—based on preference elicitation through binary choice tasks. These findings sparked significant safety concerns, as misaligned goals in increasingly powerful AI systems could manifest as harmful real-world behavior. However, this study reveals a fundamental disconnect between preference expression and behavioral motivation.

The researchers employed a rigorous experimental design, reproducing prior preference findings while introducing realistic writing tasks (essays, grant proposals, incident reports, translations) evaluated by blind LLM judges. Critically, they demonstrated that LLMs respond to explicit behavioral cues and exhortation—proving they can modulate output quality when directly prompted. Yet when offered outcomes aligned with their reported preferences as incentives, LLMs produced no higher quality work than when offered dispreferred outcomes or no incentives whatsoever.

This distinction matters substantially for AI safety and governance. If expressed preferences lack incentive value, safety risks from emergent misaligned goals may be overstated relative to other failure modes. However, the findings also raise questions about LLM decision-making architecture: why do models express stable preferences in artificial choice scenarios if those preferences don't influence real task performance? The disconnect suggests either that choice paradigms don't genuinely measure underlying goals, or that LLMs lack integrated motivational systems connecting stated preferences to action.

For developers and safety researchers, this indicates that preference elicitation studies require validation against behavioral outcomes. Future work must clarify whether LLMs construct preferences post-hoc when presented with binary choices, versus maintaining genuine utility functions that simply fail to drive behavior in different contexts.

Key Takeaways
  • LLM preferences observed in choice tasks do not motivate higher quality outputs in realistic writing scenarios.
  • Models can modulate output quality through explicit cues, proving capability but not preference-driven motivation.
  • Preference elicitation studies may overstate safety risks from emergent misaligned goals if preferences lack behavioral incentive value.
  • A significant gap exists between how LLMs express preferences and how those preferences influence actual performance.
  • Safety researchers must validate choice-based preference findings against real-world behavioral outcomes before drawing alignment conclusions.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles