MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Researchers introduce MENTIS, a framework for measuring internal geometric changes in language models during preference alignment training. The study reveals that alignment leaves selective, depth-localized signatures in model computations, with normative concepts showing larger internal reorganization than factual concepts across multiple model architectures.