Tool Calling is Linearly Readable and Steerable in Language Models
Researchers discovered that language models encode tool-selection decisions in interpretable linear patterns within their internal activations, enabling both prediction of errors before execution and steering of tool choices at 77-100% accuracy. This finding has implications for making AI agents more reliable and controllable, particularly in high-stakes scenarios where wrong tool selection causes irreversible failures.