Tool Calling is Linearly Readable and Steerable in Language Models
When a tool-calling agent picks the wrong tool, the failure remains invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B parameters), we show that the identity of the chosen tool is linearly readable and steerable within the model's hidden states. By adding a vector derived from the mean difference of internal activations between two tools, we can switch the model's selection with 77-100% accuracy on single-turn prompts using tool names only (93-100% for models with 4B+ parameters). Autoregressively generated JSON arguments subsequently align with the new tool's schema, enabling precise, linear control over tool-calling behavior without fine-tuning.
Background and Context
The integration of tool-calling capabilities into large language models has fundamentally transformed these systems from passive text generators into active agents capable of executing complex workflows. However, a critical vulnerability in this architecture has long persisted: when an agent selects the incorrect tool, the error remains invisible until the point of execution. This latency in error detection can lead to irreversible consequences, such as emails being sent to the wrong recipients or critical meetings being missed. The opacity of the decision-making process within the model's internal states has made it difficult to diagnose or prevent these missteps before they occur. Recent research addresses this opacity by probing the internal representations of language models to understand how tool selection is encoded.
The study focuses on a diverse set of 12 instruction-tuned models, spanning the Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 families. These models range significantly in scale, from 270 million to 27 billion parameters, allowing for a robust analysis of how model size influences the linear readout and steerability of tool identity. By examining these specific architectures, the research aims to determine whether the choice of a tool is encoded in a way that is both interpretable and modifiable through linear interventions in the hidden states. This investigation is crucial for developing more reliable AI agents that can be trusted in high-stakes environments where execution errors are costly.
Deep Analysis
The core finding of the research is that the identity of the chosen tool is linearly readable and steerable within the model's hidden states. This means that the neural representation of a specific tool is not scattered randomly but is aligned in a direction that can be identified and manipulated. To demonstrate this, the researchers analyzed the internal activations of the models when prompted with tool names. They calculated the mean difference in internal activations between two different tools. By adding a vector derived from this mean difference to the model's internal state, they were able to influence the model's tool selection.
The effectiveness of this linear steering mechanism is remarkably high. On single-turn prompts containing only tool names, the intervention switched the model's selection with an accuracy of 77% to 100%. For larger models with 4 billion parameters or more, the accuracy increased to between 93% and 100%. This indicates that larger models encode tool identities more distinctly and robustly, making them more amenable to linear control. The ability to switch the selected tool with such high precision suggests that the decision boundary between tools is linearly separable in the model's representation space.
Furthermore, the steering effect extends beyond just the tool selection. The JSON arguments generated autoregressively after the tool call align with the schema of the newly selected tool. This implies that the linear intervention not only changes the tool name but also influences the subsequent generation of parameters, ensuring consistency with the new tool's requirements. This holistic control over the tool-calling behavior, without the need for fine-tuning, provides a powerful mechanism for correcting errors or guiding agent behavior in real-time.
Industry Impact
The ability to linearly read and steer tool-calling behavior has significant implications for the reliability and safety of AI agents. Currently, debugging tool-calling errors often requires extensive logging and post-hoc analysis. With linear steerability, developers can implement real-time monitoring and correction mechanisms. If an agent is detected to be heading toward a suboptimal or incorrect tool selection, a linear intervention can redirect it before execution. This reduces the risk of operational failures and enhances the trustworthiness of AI systems in production environments.
This technique also opens new avenues for improving the efficiency of AI agents. By steering the model toward more appropriate tools, agents can reduce the number of incorrect attempts and iterations required to complete a task. This is particularly important in scenarios where API calls are costly or rate-limited. The linear control mechanism allows for precise adjustments without the computational overhead of retraining or fine-tuning the model, making it a scalable solution for improving agent performance.
Additionally, the findings contribute to the broader field of mechanistic interpretability. By demonstrating that tool identity is linearly readable, the research provides a concrete example of how complex behaviors in large language models can be understood and manipulated through linear algebraic operations. This advances our understanding of how language models represent and process information, paving the way for more interpretable and controllable AI systems.
Outlook
Looking ahead, the ability to steer tool-calling behavior linearly is likely to become a standard feature in the development of robust AI agents. As the industry moves towards more autonomous and complex agent workflows, the need for reliable error correction and real-time control will become increasingly critical. The techniques demonstrated in this research provide a foundation for building agents that can self-correct and adapt to changing conditions without human intervention.
Future research may explore extending this linear steering mechanism to other aspects of agent behavior, such as reasoning steps or multi-turn dialogue management. Additionally, investigating the limits of this approach in more complex and noisy environments will be important for ensuring its robustness. As models continue to grow in size and capability, the linear structure of their internal representations may become even more pronounced, offering new opportunities for control and interpretability.
The implications for the AI industry are profound. By enabling precise control over tool-calling behavior, this research helps bridge the gap between theoretical capabilities and practical reliability. It suggests a future where AI agents are not only powerful but also predictable and safe, capable of operating in dynamic environments with minimal risk of error. This shift towards more controllable and interpretable AI systems will be essential for the widespread adoption of autonomous agents in critical industries.