🧠 AI🟢 BullishImportance 7/10

Tool Calling is Linearly Readable and Steerable in Language Models

arXiv – CS AI|Zekun Wu (University College London), Ze Wang (University College London), Seonglae Cho (Holistic AI), Yufei Yang (Imperial College London), Adriano Koshiyama (University College London), Sahan Bulathwela (University College London), Maria Perez-Ortiz (University College London)|May 11, 2026 at 04:00 AM

🤖AI Summary

Researchers discovered that language models encode tool-selection decisions in interpretable linear patterns within their internal activations, enabling both prediction of errors before execution and steering of tool choices at 77-100% accuracy. This finding has implications for making AI agents more reliable and controllable, particularly in high-stakes scenarios where wrong tool selection causes irreversible failures.

Analysis

This research addresses a critical vulnerability in AI agent systems: silent failures where models select incorrect tools that only become apparent after execution. By analyzing 12 instruction-tuned models ranging from 270M to 27B parameters, the study reveals that tool identity is linearly encoded in model activations—meaning the decision is baked into the model's internal representations before token generation begins. This represents a significant advance in AI interpretability and safety.

The technical findings are striking: adding activation vectors corresponding to tool differences switches model choices at near-perfect accuracy while maintaining schema-compliance, and the effect concentrates in the output layer and specific attention heads in mid-to-late layers. More importantly, the researchers identified a predictive signal: when multiple tools score similarly in the model's internal state, errors increase 14-21x—enabling preventive detection before deployment consequences unfold.

The implications extend across AI development. For practitioners, this research provides mechanistic tools to both audit and correct agent behavior without retraining. For safety researchers, it demonstrates that tool-selection failures aren't mysterious black-box phenomena but interpretable patterns that open debugging pathways. The discovery that base models encode the right tool (69-82% accuracy) before instruction tuning wires it to output suggests these representations emerge naturally during pretraining.

Limitations matter: the research focuses on single-turn, fixed-menu scenarios rather than multi-turn agentic scenarios where tool selection becomes more complex. Broader deployment requires testing whether these steering and detection methods generalize across diverse tool domains and real-world agent loops. Future work should explore whether these linear representations hold as agents operate dynamically across conversation contexts.

Key Takeaways

→Tool selection in language models is linearly readable from internal activations, enabling 77-100% accuracy steering without retraining.
→A small set of mid-to-late-layer attention heads concentrate the causal effect for tool choice, suggesting interpretable loci for intervention.
→Tool-selection errors can be predicted before execution when internal activation gaps between candidate tools are small (14-21x error increase).
→Base models encode correct tool identity before instruction tuning connects it to output, suggesting representations form during pretraining.
→Current findings apply to single-turn fixed-menu settings; multi-turn agentic scenarios remain more fragile and require further investigation.

Mentioned in AI

Models

LlamaMeta