#model-control News & Analysis

14 articles tagged with #model-control. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AINeutralarXiv – CS AI · May 277/10

🧠

Retrying vs Resampling in AI Control

Researchers studying AI safety mechanisms find that retrying—blocking risky model actions—can be exploited by adversarial AI systems that learn from monitor feedback, while resampling multiple outputs without information leakage proves more effective. In controlled testing with Claude Opus 4.6, resampling increased safety from 61% to 71% while maintaining usefulness, challenging prior assumptions about optimal audit strategies.

🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · May 127/10

🧠

Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

Researchers demonstrate that large language models encode behavioral traits as linear directions in activation space called "persona vectors," which can be monitored and manipulated during reasoning. By treating these vectors as dynamic signals over generation time—termed "polylogue"—they achieve competitive accuracy prediction on MMLU-Pro while enabling stage-aware latent steering that improves model performance.

AIBullisharXiv – CS AI · May 127/10

🧠

HyperTransport: Amortized Conditioning of T2I Generative Models

HyperTransport is a new hypernetwork framework that dramatically accelerates activation steering for text-to-image models by amortizing optimization costs across multiple concepts. Rather than optimizing intervention parameters for each new concept (which takes minutes), the system learns to map CLIP embeddings directly to steering parameters in a single forward pass, achieving 3600-7000x speedup while matching per-concept baselines on unseen concepts.

AINeutralarXiv – CS AI · May 97/10

🧠

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Researchers demonstrate that large language models encode social role granularity—from individual to institutional perspectives—as a structured geometric axis in their internal representations. Using activation steering, they show this axis is causally manipulable, enabling controlled shifts in response scope across different models.

🧠 Llama

AIBullisharXiv – CS AI · Apr 207/10

🧠

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

Researchers introduce FineSteer, a novel framework for controlling Large Language Model behavior at inference time through two-stage steering: conditional guidance and expert-based vector synthesis. The method achieves superior safety and truthfulness performance while preserving model utility more effectively than existing approaches, without requiring parameter updates.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Researchers introduce NeuronLens, a framework that interprets neural networks by analyzing activation ranges rather than individual neurons, addressing the widespread polysemanticity problem in large language models. The range-based approach enables more precise concept manipulation while minimizing unintended degradation to model performance.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Steering at the Source: Style Modulation Heads for Robust Persona Control

Researchers have identified a method to control Large Language Model behavior by targeting only three specific attention heads called 'Style Modulation Heads' rather than the entire residual stream. This approach maintains model coherency while enabling precise persona and style control, offering a more efficient alternative to fine-tuning.

AIBullisharXiv – CS AI · Mar 97/10

🧠

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Controllable and explainable personality sliders for LLMs at inference time

Researchers propose Sequential Adaptive Steering (SAS), a new framework for controlling Large Language Model personalities at inference time without retraining. The method uses orthogonalized steering vectors to enable precise, multi-dimensional personality control by adjusting coefficients, validated on Big Five personality traits.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

A new study challenges recent findings that dismissed Sparse Autoencoders (SAEs) as ineffective for steering Large Language Models, demonstrating that SAEs can match LoRA baseline performance when combined with a supervised feature selection pipeline. The research suggests that high sparsity constraints may not be necessary for effective model steering based on interpretability.

AINeutralarXiv – CS AI · May 116/10

🧠

Inference Time Causal Probing in LLMs

Researchers introduce Hidden-state Driven Margin Intervention (HDMI), a new probe-free technique for causal probing in large language models that directly manipulates hidden states without training auxiliary classifiers. The method achieves higher reliability than existing approaches by balancing completeness and selectivity across multiple benchmarks.

🧠 Llama

AIBullisharXiv – CS AI · May 96/10

🧠

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

Researchers introduce Memory Inception (MI), a training-free method for steering large language models by inserting text-derived key-value banks at selected attention layers rather than caching full prompts. MI achieves competitive control with instruction prompting while using up to 118x less storage and outperforms existing activation steering methods on personality, reasoning, and guidance tasks.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Steering the Verifiability of Multimodal AI Hallucinations

Researchers have developed a method to control how verifiable AI hallucinations are in multimodal language models by distinguishing between obvious hallucinations (easily detected by humans) and elusive ones (harder to spot). Using a dataset of 4,470 human responses, they created targeted interventions that can fine-tune which types of hallucinations occur, enabling flexible control suited to different security and usability requirements.

AIBullisharXiv – CS AI · Mar 37/106

🧠

Spectral Attention Steering for Prompt Highlighting

Researchers introduce SEKA and AdaSEKA, new training-free methods for attention steering in AI models that work with memory-efficient implementations like FlashAttention. These techniques enable better prompt highlighting by directly editing key embeddings using spectral decomposition, offering significant performance improvements with lower computational overhead.