#model-steering News & Analysis

25 articles tagged with #model-steering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

25 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Researchers discovered that language models can detect undesirable behaviors like hallucination with near-perfect accuracy, yet the neural directions enabling detection are nearly orthogonal (83 degrees apart) from those controlling the behavior. This fundamental geometric dissociation between knowing and steering persists across multiple models and scales, challenging a core assumption of mechanistic interpretability that detection should enable control.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Toward Preference-aligned Large Language Models via Residual-based Model Steering

Researchers introduce PaLRS, a training-free method for aligning large language models with human preferences using lightweight steering vectors extracted from residual streams. The approach requires minimal data (100+ preference pairs) and achieves better performance than standard optimization methods like DPO with significantly lower computational costs.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Researchers introduce ViSAE, a mechanistic interpretability toolbox that uses neuroscience-inspired principles to decode how Vision Transformers make decisions through human-interpretable concept circuits. The method achieves significant improvements in model auditing and steering, with concept editing improving worst-group accuracy by 48.2% on benchmark tests, addressing critical safety concerns before ViT deployment.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

Researchers demonstrate that safety behaviors in generative AI models can be represented as portable latent directions that transfer across different architectures without requiring unsafe training data on target models. This framework enables cross-model safety steering for text-to-image and text-to-video generation, suggesting safety is a shared property rather than model-specific.

AINeutralarXiv – CS AI · May 297/10

🧠

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Researchers successfully trained sparse autoencoders with 34 million features on Claude 3 Sonnet, demonstrating that dictionary learning methods can scale to production-grade language models. The extracted features show interpretability across languages and modalities, identify harmful behavioral patterns like deception and bias, and enable direct steering of model outputs—though significant limitations remain in feature completeness and validation rigor.

🧠 Claude

AIBullisharXiv – CS AI · May 297/10

🧠

Modeling Hierarchical Thinking in Large Reasoning Models

Researchers propose modeling Large Reasoning Models' Chain-of-Thought processes as trajectories through a six-state Finite State Machine, enabling better understanding and control of reasoning dynamics. They introduce Q-Value guided steering, a training-free method that optimizes reasoning by applying sparse activation steering at sentence boundaries, achieving significant performance gains across multiple benchmarks with minimal computational overhead.

AIBullisharXiv – CS AI · May 127/10

🧠

Towards Effective Theory of LLMs: A Representation Learning Approach

Researchers introduce Representational Effective Theory (RET), a framework that interprets large language model computation through learned high-level variables rather than individual neuron activations. The approach successfully identifies meaningful mental-state trajectories, enables early prediction of behavioral patterns like sycophancy, and provides causal mechanisms for steering model outputs, suggesting LLMs can be understood and controlled through effective macroscopic descriptions.

AIBullisharXiv – CS AI · May 97/10

🧠

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

Researchers introduce TACT, a technique using activation steering to detect and correct 'agent drift' in language model coding agents, where models either repeatedly reason over known information or issue tool calls without proper reasoning. The method improves task resolution rates by 4.8-5.8 percentage points across multiple benchmarks while reducing steps needed to complete tasks by up to 26%.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Distributed Interpretability and Control for Large Language Models

Researchers have developed a scalable system for interpreting and controlling large language models distributed across multiple GPUs, achieving up to 7x memory reduction and 41x throughput improvements. The method enables real-time behavioral steering of frontier LLMs like LLaMA and Qwen without fine-tuning, with results released as open-source tooling.

AIBullisharXiv – CS AI · Apr 77/10

🧠

Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

Researchers introduce a geometric framework for understanding LLM hallucinations, showing they arise from basin structures in latent space that vary by task complexity. The study demonstrates that factual tasks have clearer separation while summarization tasks show unstable, overlapping patterns, and proposes geometry-aware steering to reduce hallucinations without retraining.

AINeutralarXiv – CS AI · Mar 277/10

🧠

Sparse Visual Thought Circuits in Vision-Language Models

Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.

AINeutralarXiv – CS AI · Mar 277/10

🧠

Closing the Confidence-Faithfulness Gap in Large Language Models

Researchers have identified a fundamental issue in large language models where verbalized confidence scores don't align with actual accuracy due to orthogonal encoding of these signals. They discovered a 'Reasoning Contamination Effect' where simultaneous reasoning disrupts confidence calibration, and developed a two-stage adaptive steering pipeline to improve alignment.

AIBullisharXiv – CS AI · Mar 37/102

🧠

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Researchers introduce Sparse Shift Autoencoders (SSAEs), a new method for improving large language model interpretability by learning sparse representations of differences between embeddings rather than the embeddings themselves. This approach addresses the identifiability problem in current sparse autoencoder techniques, potentially enabling more precise control over specific AI behaviors without unintended side effects.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Steering Vision-Language Models with Joint Sparse Autoencoders

Researchers introduce Joint Sparse Autoencoders (JSAE), a technique that improves how vision-language models can be analyzed and controlled by aligning visual and textual representations into shared, interpretable features. Testing across multiple VLM architectures reveals that steering interventions work most effectively at mid-to-late layers, offering insights for more precise multimodal model control.

🧠 Llama

AINeutralarXiv – CS AI · Jun 106/10

🧠

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Researchers have developed sparse autoencoders to interpret and control how language models process text-to-speech synthesis in CosyVoice3. The work demonstrates that interpretable features—phonemes, laughter, accent, and speaker gender—are causally linked to speech output and can be precisely steered to modify synthesis behavior without retraining.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

Researchers introduce flow control, a technique that enables real-time steering of vision-language-action (VLA) models through simple user inputs like keyboards without requiring model retraining. The method allows users to guide robot actions toward their intent while maintaining high-quality outputs aligned with the model's learned expert distribution, improving task success rates and completion times.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Researchers have developed a pre-intervention screening framework that predicts unintended side effects of sparse autoencoder (SAE) steering in language models before they occur. By analyzing feature statistics, the framework identifies which steering interventions will behave consistently and avoid disrupting unrelated features, with varying success across different model architectures.

🧠 Llama

AINeutralarXiv – CS AI · Jun 86/10

🧠

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

Researchers introduce VALUEFLOW, a comprehensive framework for aligning Large Language Models with diverse human values through hierarchical extraction, calibrated intensity evaluation, and steerable control mechanisms. The system addresses fundamental limitations in existing preference-based alignment approaches by enabling precise, multi-theory value alignment at controlled intensities across different models.

AINeutralarXiv – CS AI · May 286/10

🧠

Cultural Binding Heads in Language Models

Researchers identify specific attention heads in large language models responsible for cultural binding—associating cultural items with appropriate identities. Through mechanistic interpretability analysis, they find that steering these heads can improve cultural differentiation accuracy by 1-3 percentage points, revealing that models possess far more cultural knowledge than they actively use.

AINeutralarXiv – CS AI · May 286/10

🧠

Multi-Adapter Representation Interventions via Energy Calibration

Researchers propose MARI, a novel method for aligning large language models through adaptive representation interventions that adjust correction strength per input rather than applying uniform fixes. The approach combines multi-adapter experts with energy-based gating to maintain general model capabilities while improving alignment on safety and truthfulness benchmarks.

AIBullisharXiv – CS AI · May 126/10

🧠

Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution

Researchers introduce QD-LLM, a framework that evolves lightweight prompt embeddings (~32K parameters) to steer frozen large language models toward diverse outputs without fine-tuning. The approach outperforms existing quality-diversity optimization methods by 46.4% in coverage and demonstrates practical applications in test generation and training data improvement.

🧠 Llama

AIBullisharXiv – CS AI · Apr 66/10

🧠

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Researchers developed a method to identify valence-arousal subspaces in large language models, enabling controlled emotional steering of AI outputs. The technique demonstrates cross-architecture effectiveness on multiple models and reveals that emotional control can bidirectionally influence AI behaviors like refusal and sycophancy.

🧠 Llama

AINeutralarXiv – CS AI · Mar 176/10

🧠

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

Researchers introduce Gradient Atoms, an unsupervised method that decomposes AI model training gradients to discover interpretable behaviors without requiring predefined queries. The technique can identify model behaviors like refusal patterns and arithmetic capabilities, while also serving as effective steering vectors to control model outputs.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

Researchers developed training-free model steering techniques to improve reasoning in large audio-language models (LALMs) through chain-of-thought prompting. The approach achieved up to 4.4% accuracy gains and demonstrated cross-modal transfer where text-derived steering vectors can effectively guide speech-based reasoning.

AIBullisharXiv – CS AI · Mar 36/104

🧠

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Researchers have developed EasySteer, a unified framework for controlling large language model behavior at inference time that achieves 10.8-22.3x speedup over existing frameworks. The system offers modular architecture with pre-computed steering vectors for eight application domains and transforms steering from a research technique into production-ready capability.