MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Researchers introduce MENTIS, a framework for measuring internal geometric changes in language models during preference alignment training. The study reveals that alignment leaves selective, depth-localized signatures in model computations, with normative concepts showing larger internal reorganization than factual concepts across multiple model architectures.
MENTIS addresses a critical gap in AI safety research: understanding what happens inside language models when they undergo preference alignment training. While post-training alignment visibly improves model behavior, the internal computational mechanisms remain largely opaque. This opacity matters because aligned models still succumb to jailbreaks and prompt injection attacks, suggesting surface-level behavioral improvements mask underlying vulnerabilities. The framework's geometry-first approach uses covariance-based torsion measurements to quantify internal reorganization, revealing that alignment doesn't uniformly reshape model weights but instead creates selective, structured changes concentrated in mid-to-late layers. The distinction between normative and factual concept shifts suggests alignment training differentially affects value-laden versus knowledge-based reasoning pathways. Across four 7-8B model pairs, consistent patterns emerge: torsion correlates negatively with contextual entropy and concentrates in architecture-specific regions, indicating alignment creates interpretable geometric signatures. This research advances interpretability research by providing quantitative tools for examining post-training effects beyond behavioral evaluation. For developers and safety researchers, these insights could enable better vulnerability detection and more targeted alignment techniques. The finding that alignment is selective rather than uniform opens possibilities for identifying which internal structures remain unaligned, potentially explaining alignment failures. Future work might leverage these geometric signatures to predict failure modes or develop more robust alignment methods targeting specific conceptual domains.
- βMENTIS framework reveals alignment training creates selective, structured changes in model internals rather than uniform reorganization
- βNormative concepts show larger internal geometric shifts than factual concepts during preference alignment
- βAlignment-induced changes concentrate in mid-to-late layers with architecture-specific patterns across model families
- βNegative correlation between torsion and contextual entropy suggests information complexity influences alignment's internal effects
- βGeometry-based internal analysis reveals signatures of misalignment that behavioral evaluation alone cannot detect