AgentLens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent
Researchers introduce AgentLens, a white-box defense framework that detects and mitigates safety risks in multi-turn LLM coding agents by intervening in mechanistic subspaces. The framework achieves strong safety detection performance through step-level hidden representation analysis, addressing the limitations of external guardrails in capturing evolving execution risks.
AgentLens represents a significant advancement in LLM safety research by shifting from reactive external guardrails to proactive internal mechanism control. Traditional safety approaches operate at the model boundary, but coding agents executing multi-turn interactions with external environments require real-time behavioral steering that external filters cannot provide. This work bridges that gap through mechanistic interpretability, enabling detection of harmful execution states before they manifest as dangerous actions.
The safety landscape for autonomous agents has grown increasingly urgent as LLMs demonstrate genuine capability for complex task execution. Previous mechanistic interpretability research focused on jailbreak scenarios and single-turn interactions, missing the unique challenges of multi-turn agent execution where risks accumulate and compound across steps. AgentLens's introduction of the Mechanistic Agent Safety benchmark with 194 comprehensively annotated tasks across multiple model architectures establishes empirical infrastructure for this emerging field.
The technical approach—intervening in a 10-dimensional subspace within a single layer—demonstrates that safety control requires neither full model retraining nor heavyweight external monitoring. This efficiency matters significantly for deployment, as it maintains model performance while reducing computational overhead. The framework's ability to provide lookahead risk anticipation suggests potential for preventing harmful actions before execution rather than merely detecting them afterward.
The implications extend across AI development broadly. As coding agents and autonomous systems proliferate in production environments, internal safety mechanisms become critical infrastructure. This work establishes mechanistic interpretability as a viable approach for agent safety, likely influencing how future systems integrate safety guarantees. The published code and benchmark enable community-wide advancement in this direction.
- →AgentLens performs runtime safety detection through step-level hidden representations rather than external guardrails alone.
- →The framework intervenes in a single 10-dimensional mechanistic subspace to substantially reduce harmful coding agent actions.
- →The Mechanistic Agent Safety benchmark provides 194 comprehensively annotated multi-turn trajectories across LLaMA-3.1, Qwen-2.5, and Gemma-2.
- →Internal mechanistic intervention offers efficiency advantages over full model retraining or heavyweight external monitoring systems.
- →Lookahead risk anticipation capabilities suggest potential for preventing harmful actions rather than solely detecting them post-hoc.