🧠 AI⚪ NeutralImportance 7/10

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

arXiv – CS AI|Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das|June 2, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce MENTIS, a framework for measuring internal geometric changes in language models during preference alignment training. The study reveals that alignment leaves selective, depth-localized signatures in model computations, with normative concepts showing larger internal reorganization than factual concepts across multiple model architectures.

Analysis

MENTIS addresses a critical gap in AI safety research: understanding what happens inside language models when they undergo preference alignment training. While post-training alignment visibly improves model behavior, the internal computational mechanisms remain largely opaque. This opacity matters because aligned models still succumb to jailbreaks and prompt injection attacks, suggesting surface-level behavioral improvements mask underlying vulnerabilities. The framework's geometry-first approach uses covariance-based torsion measurements to quantify internal reorganization, revealing that alignment doesn't uniformly reshape model weights but instead creates selective, structured changes concentrated in mid-to-late layers. The distinction between normative and factual concept shifts suggests alignment training differentially affects value-laden versus knowledge-based reasoning pathways. Across four 7-8B model pairs, consistent patterns emerge: torsion correlates negatively with contextual entropy and concentrates in architecture-specific regions, indicating alignment creates interpretable geometric signatures. This research advances interpretability research by providing quantitative tools for examining post-training effects beyond behavioral evaluation. For developers and safety researchers, these insights could enable better vulnerability detection and more targeted alignment techniques. The finding that alignment is selective rather than uniform opens possibilities for identifying which internal structures remain unaligned, potentially explaining alignment failures. Future work might leverage these geometric signatures to predict failure modes or develop more robust alignment methods targeting specific conceptual domains.

Key Takeaways

→MENTIS framework reveals alignment training creates selective, structured changes in model internals rather than uniform reorganization
→Normative concepts show larger internal geometric shifts than factual concepts during preference alignment
→Alignment-induced changes concentrate in mid-to-late layers with architecture-specific patterns across model families
→Negative correlation between torsion and contextual entropy suggests information complexity influences alignment's internal effects
→Geometry-based internal analysis reveals signatures of misalignment that behavioral evaluation alone cannot detect

#language-models #alignment #interpretability #mechanistic-analysis #ai-safety #internal-representations #post-training #geometric-analysis

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge