Research comparing 120 base and aligned language model pairs reveals that alignment training makes models more normative but less descriptive of actual human behavior. Base models predict real human choices in multi-round strategic games 10 times better, while aligned models excel only in single-shot, textbook scenarios where human behavior follows rational expectations.
This research exposes a critical tension in how language models are optimized post-training. Alignment procedures—designed to make models safer and more useful—fundamentally reshape their behavioral predictions by pushing them toward idealized, rational decision-making rather than authentic human conduct. The 10:1 predictive advantage of base models in strategic, multi-round settings suggests alignment erases the capacity to model real behavioral dynamics like reciprocity, retaliation, and adaptive learning that emerge through interaction.
The pattern is particularly revealing in its boundary conditions. Aligned models perform comparably to base models in single-shot games and lottery choices where human behavior genuinely aligns with normative game theory. But in bargaining, persuasion, and negotiation—contexts where humans adapt based on history and counterparty behavior—aligned models become poor descriptors. This suggests alignment training implicitly teaches models to ignore messy, adaptive decision-making patterns in favor of rule-based, utility-maximizing behavior.
For the AI industry, this creates a profound trade-off. Models optimized for human preference alignment serve human users well but become worse scientific instruments for understanding actual human behavior. Researchers seeking to model human decision-making must choose between using aligned models that reflect preferred norms or base models that capture authentic behavioral complexity. Organizations building AI systems for economics research, behavioral simulation, or user modeling must now explicitly account for this normative bias. The research suggests that different applications require fundamentally different model choices, challenging the assumption that alignment improvements universally enhance model quality across all use cases.
- →Alignment training optimizes for human preferences but reduces model accuracy in predicting actual human behavior in strategic interactions by ~90%
- →Base models outperform aligned counterparts in multi-round games where human decisions depend on reciprocity and adaptation rather than rational theory
- →Aligned models maintain predictive parity only in single-shot scenarios where human behavior conforms to normative game theory solutions
- →The research reveals a fundamental incompatibility between optimizing models for safe human use and using them as descriptive proxies for behavior
- →Different applications now require deliberate trade-offs between alignment-improved safety and base-model accuracy for behavioral modeling