Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Researchers have identified a critical flaw in large language models where moral values inappropriately influence judgments about grammatical and economic quality. The study reveals that LLMs conflate different types of value rather than distinguishing them as humans do, a problem that can be partially fixed through targeted ablation of morality-related activation vectors.
This research addresses a fundamental challenge in AI safety: ensuring that language models accurately represent and process distinct domains of value. The findings reveal that current LLMs don't compartmentalize moral, grammatical, and economic evaluations the way humans naturally do. Instead, moral considerations bleed into areas where they shouldn't dominate decision-making, such as assessing writing quality or economic efficiency. This 'value entanglement' occurs at multiple levels of model architecture, including embeddings and residual stream activations, suggesting the problem is deeply embedded in how these systems learn to represent concepts.
The research emerges amid growing concern about value alignment in increasingly capable AI systems. As LLMs become more influential in decision-making across society, understanding how they internally represent and balance competing values becomes critical. Previous work has focused on whether models have the right values, but this study investigates the structural fidelity of value representation itself. The empirical approach—probing actual model behaviors rather than relying on behavioral outputs alone—provides concrete evidence of misalignment at the mechanical level.
For the AI development community, this work suggests that simple alignment techniques may be insufficient without understanding the underlying representational architecture. The ability to repair value entanglement through selective ablation demonstrates that targeted interventions can work, but it also implies that developers must carefully audit internal model states, not just output behaviors. Going forward, this research highlights the need for more sophisticated probing methods to detect subtle misalignments before systems are deployed in high-stakes environments where conflating moral and economic values could produce harmful outcomes.
- →LLMs conflate moral, grammatical, and economic values rather than maintaining separate representations like humans do
- →Moral considerations inappropriately dominate judgments in non-moral domains, creating systematic value misalignment
- →The problem is detectable through multiple model introspection methods including embeddings and activation analysis
- →Selective ablation of morality-related activation vectors can partially repair value entanglement in tested scenarios
- →Current alignment approaches may miss structural representational issues that only surface under detailed mechanical inspection