#activation-analysis News & Analysis

5 articles tagged with #activation-analysis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AINeutralarXiv – CS AI · 1d ago7/10

🧠

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Researchers have identified a critical flaw in large language models where moral values inappropriately influence judgments about grammatical and economic quality. The study reveals that LLMs conflate different types of value rather than distinguishing them as humans do, a problem that can be partially fixed through targeted ablation of morality-related activation vectors.

AIBullisharXiv – CS AI · May 127/10

🧠

Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

Researchers demonstrate that large language models encode behavioral traits as linear directions in activation space called "persona vectors," which can be monitored and manipulated during reasoning. By treating these vectors as dynamic signals over generation time—termed "polylogue"—they achieve competitive accuracy prediction on MMLU-Pro while enabling stage-aware latent steering that improves model performance.

AINeutralarXiv – CS AI · May 17/10

🧠

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Researchers demonstrate that multi-turn prompt injection attacks leave detectable signatures in language model activation patterns, achieving 93.8% detection accuracy through analysis of residual stream trajectories. The approach reveals that adversarial attack sequences exhibit distinctive 'restlessness' patterns across model architectures, though detection effectiveness varies significantly when deployed on real-world data.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Distributed Interpretability and Control for Large Language Models

Researchers have developed a scalable system for interpreting and controlling large language models distributed across multiple GPUs, achieving up to 7x memory reduction and 41x throughput improvements. The method enables real-time behavioral steering of frontier LLMs like LLaMA and Qwen without fine-tuning, with results released as open-source tooling.

AINeutralarXiv – CS AI · Mar 46/103

🧠

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Researchers found that narrow finetuning of Large Language Models leaves detectable traces in model activations that can reveal information about the training domain. The study demonstrates that these biases can be used to understand what data was used for finetuning and suggests mixing pretraining data into finetuning to reduce these traces.