Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models
Researchers demonstrate that large language models express values through two distinct but partially overlapping mechanisms: intrinsic values learned during training and prompted values elicited by explicit instructions. Using mechanistic analysis of value vectors and neurons, the study reveals that while both mechanisms share common components, they serve different functions—intrinsic values promote response diversity while prompted values enforce instruction compliance.
This research addresses a critical gap in understanding how large language models implement value alignment, a foundational concern as AI systems become increasingly integrated into sensitive decision-making processes. The findings challenge the assumption that intrinsic and prompted values operate through identical mechanisms, revealing instead a more nuanced architecture where shared and distinct neural pathways coexist.
The mechanistic approach using value vectors and neurons provides unprecedented visibility into model internals, moving beyond behavioral observation to inspect how values are actually computed. The discovery that prompted mechanisms can override intrinsic values with particular effectiveness in adversarial scenarios like jailbreaking has significant implications for AI safety practitioners designing robust alignment strategies. The research's demonstration of cross-language generalization and correlation with theoretical value relationships suggests these mechanisms operate at a fundamental level rather than as superficial pattern matching.
For the AI development community, these insights directly inform how to design more reliable safety mechanisms and value-aligned systems. Understanding that instruction compliance can be strengthened through the prompted pathway while maintaining response diversity through intrinsic values enables more sophisticated alignment approaches. The distinct roles of each mechanism suggest that robust value alignment may require targeted interventions rather than single unified solutions.
Future work should investigate whether these distinct pathways can be deliberately engineered to resist adversarial prompts, and whether similar mechanistic separation exists in even larger models or multimodal systems. The research pathway established here could become foundational for developing interpretable, trustworthy AI systems.
- →LLMs employ two partially overlapping but mechanistically distinct pathways for expressing values: intrinsic learned values and prompted instruction-based values.
- →Intrinsic value mechanisms promote diverse responses across varied scenarios while prompted mechanisms enforce stronger instruction compliance.
- →Prompted value mechanisms demonstrate particular effectiveness in adversarial contexts like jailbreaking, representing a potential safety vulnerability.
- →Cross-language generalization of value mechanisms suggests these are fundamental architectural features rather than surface-level patterns.
- →The discovery enables more targeted AI alignment strategies that can leverage different mechanisms for different safety objectives.