#activation-space News & Analysis

4 articles tagged with #activation-space. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Mar 57/10

🧠

Controlling Chat Style in Language Models via Single-Direction Editing

Researchers developed a training-free method to control stylistic attributes in large language models by identifying that different styles are encoded as linear directions in the model's activation space. The approach enables precise style control while preserving core capabilities and supports linear style composition across over a dozen tested models.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Researchers demonstrate that general-purpose persona steering vectors can reduce AI model sycophancy (agreement with incorrect users) nearly as effectively as specialized steering methods, while maintaining accuracy on correct statements. This challenges the assumption that sycophancy requires targeted mitigation and suggests it operates as a persona-level property rather than a single manipulable direction.

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Researchers demonstrate that Large Language Models encode truth as geometric vectors in their activation space, and these vectors undergo predictable transformations when contextual information is introduced. The study reveals that larger models rely on directional changes to distinguish relevant context while smaller models use magnitude shifts, with conflicting context producing larger geometric shifts than aligned context.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Researchers demonstrate that large language models develop attractor-like geometric patterns in their activation space when processing identity documents describing persistent agents. Experiments on Llama 3.1 and Gemma 2 show paraphrased identity descriptions cluster significantly tighter than structural controls, suggesting LLMs encode semantic agent identity as stable attractors independent of linguistic variation.

🧠 Llama