In LLM Reasoning, there is Irrationality on top of Value Misalignment
Researchers identify 'rational value risk' in large language models, showing that even well-aligned LLMs fail to consistently maximize their intended values during reasoning tasks. The study across major models (Llama, GPT, DeepSeek) reveals that value alignment training alone cannot eliminate this reasoning gap, with performance highly dependent on inference-time strategies.
This research addresses a critical gap between LLM training objectives and actual reasoning behavior. While considerable effort has been invested in aligning language models with human values during training, this study demonstrates that alignment doesn't guarantee optimal decision-making when models engage in complex reasoning. The concept of rational value risk quantifies how much utility a model loses by not pursuing the mathematically optimal reasoning strategy for its stated values. Researchers decomposed this gap into three sources: limited candidate responses, insufficient prompt variations, and imperfect verification mechanisms.
The findings carry substantial implications for AI reliability and deployment. Across multiple model families and benchmarks, the research confirms that rational value risk is systematic rather than anomalous. Notably, value alignment efforts during training reduce but cannot eliminate this risk, suggesting the problem operates at a different level than traditional alignment concerns. The strong sensitivity to inference-time reasoning strategies indicates that how models are prompted and guided during deployment significantly impacts their ability to achieve their trained objectives.
For AI developers and enterprise users, this highlights that post-deployment optimization remains critical even after rigorous alignment training. The observation that longer reasoning improves rationality with diminishing returns suggests scaling reasoning computation has practical limits. This research points toward the need for runtime value verification and adaptive reasoning strategies rather than relying solely on training-time alignment. Future work likely focuses on developing better verification mechanisms and reasoning frameworks that can more consistently bridge the gap between aligned values and rational execution.
- →Even well-aligned LLMs exhibit systematic rational value risk—failing to optimize their stated values during reasoning tasks.
- →Value alignment training reduces but cannot eliminate the gap between deployed and optimal reasoning strategies.
- →Inference-time reasoning strategy selection strongly influences whether models achieve their trained objectives.
- →The rational value risk stems from finite candidate responses, limited prompt sampling, and imperfect verification mechanisms.
- →Longer reasoning improves model rationality with diminishing returns, suggesting practical limits to scaling reasoning compute.