#model-internals News & Analysis

7 articles tagged with #model-internals. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBullisharXiv – CS AI · May 277/10

🧠

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Researchers introduce SAERL, a data engineering framework that uses Sparse Autoencoders to extract intrinsic signals from LLM internals for improved reinforcement learning post-training. The method achieves 3% accuracy gains and 20% faster convergence on math reasoning tasks by modeling data diversity, difficulty, and quality—demonstrating that model internals provide practical signals beyond external training data metrics.

AIBearishApple Machine Learning · Apr 207/10

🧠

What Do Your Logits Know? (The Answer May Surprise You!)

Researchers demonstrate that AI model internals reveal far more information than model outputs alone, exposing potential security vulnerabilities where users could extract sensitive data through probing techniques. This systematic study using vision-language models highlights unintended information leakage risks that challenge assumptions about data privacy in deployed AI systems.

AINeutralarXiv – CS AI · May 296/10

🧠

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

Researchers discovered that large language model failures in clinical triage stem from output formatting constraints rather than deficient medical knowledge. Using sparse autoencoders to analyze model internals, they found medical features activate identically across free-text and multiple-choice formats, but scaffold features drive incorrect decisions at the decision token, suggesting the models possess clinical understanding but struggle with constrained response structures.

AINeutralarXiv – CS AI · May 116/10

🧠

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

Researchers developed a method to measure when language models stabilize their answer preferences during generation, before explicitly verbalizing a final answer. Using finite-answer projection analysis on the Qwen3-4B-Instruct model, they found answer preferences stabilize 17-31 tokens before the model states its answer, revealing the internal commitment dynamics of LLM reasoning.

AINeutralarXiv – CS AI · May 96/10

🧠

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Researchers developed a causal probing framework to decode how Multimodal Large Language Models internally represent visual concepts, revealing that entities are encoded in localized regions while abstract concepts distribute globally across networks. The findings expose mechanistic drivers of scaling laws and uncover a disconnect between visual perception and reasoning capabilities in MLLMs.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Relational Preference Encoding in Looped Transformer Internal States

Researchers demonstrate that looped transformers like Ouro-2.6B encode human preferences relationally rather than independently, with pairwise evaluators achieving 95.2% accuracy compared to 21.75% for independent classification. The study reveals that preference encoding is fundamentally relational, functioning as an internal consistency probe rather than a direct predictor of human annotations.

🏢 Anthropic