#tool-calling News & Analysis

22 articles tagged with #tool-calling. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · Jun 197/10

🧠

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent is a new inference-time method that improves how AI agents handle customer-service tasks by maintaining explicit task states in a separate ledger rather than reconstructing context from prompts. The approach reduces policy violations and improves decision consistency across multiple trials by validating state-dependent constraints before executing tool calls.

AIBullisharXiv – CS AI · Jun 107/10

🧠

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

Researchers introduce Activation Steering Adapter (ASA), a training-free method that improves LLM tool-calling reliability by intervening on mid-layer activations at inference time. The approach achieves significant performance gains on tool-use benchmarks without parameter updates, addressing a critical gap between what models internally represent and their actual behavior.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Researchers present a self-healing orchestration framework for tool-augmented large language models that treats reliability as a bounded runtime control problem, achieving 98.8% task success by mapping failure signals to recovery actions and verifying results. The approach outperforms retry-only and full-replanning baselines across multiple benchmarks, particularly excelling when recovery budgets are constrained.

AINeutralarXiv – CS AI · Jun 27/10

🧠

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

A new research paper identifies critical inconsistencies in how tool-calling capabilities are evaluated across LLM agents, showing that minor implementation choices significantly affect benchmark results. The authors propose two optimization techniques that accelerate reinforcement learning-based tool-calling training while maintaining performance levels.

AIBullisharXiv – CS AI · Jun 17/10

🧠

MAVEN: Improving Generalization in Agentic Tool Calling

Researchers introduce MAVEN, a symbolic reasoning framework that improves language model generalization in tool-calling tasks by 23 percentage points (48% to 71% accuracy) on a new stress-test benchmark, while maintaining cost efficiency roughly 10x lower than frontier proprietary models. The work demonstrates that lightweight verification-centered scaffolds can enhance compositional reasoning without additional model training.

AIBearisharXiv – CS AI · Jun 17/10

🧠

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

Researchers identified that indirect prompt injection attacks against ReAct AI agents succeed at dramatically different rates depending on where malicious payloads appear in tool sequences, with success rates dropping from 60% at the first tool observation to 0% at deeper positions. The study reveals that payload framing and conversation turn limits have minimal impact on attack success, making injection depth the critical vulnerability factor for AI agent systems handling real-world tasks.

🧠 GPT-4🧠 Claude

AIBullisharXiv – CS AI · May 297/10

🧠

ParaTool: Shifting Tool Representations from Context to Parameters

ParaTool is a new framework that shifts tool representations from context to parameters in large language models, enabling efficient tool calling without relying on lengthy in-context documentation. The approach uses parametric tool pre-training, soft tool selection, and fine-tuning to reduce inference overhead and hallucination risks while maintaining superior performance on benchmark tests.

AIBearisharXiv – CS AI · May 297/10

🧠

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Researchers present an empirical study examining whether Large Language Model agents with tool-calling capabilities produce consistent outputs when given identical inputs across multiple invocations. The study expands beyond prior ReAct-style research to measure behavioral reproducibility in structured tool-calling interfaces, revealing a fundamental reliability gap that could impact production deployment of LLM agents.

AIBullisharXiv – CS AI · May 117/10

🧠

Switchcraft: AI Model Router for Agentic Tool Calling

Switchcraft is a new AI model router specifically designed for agentic tool calling that selects the lowest-cost model while maintaining correctness. The system achieves 82.9% accuracy matching top models while reducing inference costs by 84%, demonstrating that larger models don't consistently outperform smaller ones on function-calling tasks.

AIBullisharXiv – CS AI · May 117/10

🧠

Tool Calling is Linearly Readable and Steerable in Language Models

Researchers discovered that language models encode tool-selection decisions in interpretable linear patterns within their internal activations, enabling both prediction of errors before execution and steering of tool choices at 77-100% accuracy. This finding has implications for making AI agents more reliable and controllable, particularly in high-stakes scenarios where wrong tool selection causes irreversible failures.

🧠 Llama

AIBullisharXiv – CS AI · May 47/10

🧠

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Researchers present a decision-making framework to optimize when large language models should call external tools like web search. The study reveals that models often misjudge their actual need for tool use, and proposes lightweight estimators trained on hidden states to improve tool-calling decisions, demonstrating performance gains across multiple tasks.

AIBullisharXiv – CS AI · Apr 207/10

🧠

PolicyBank: Evolving Policy Understanding for LLM Agents

Researchers introduce PolicyBank, a memory mechanism that allows LLM agents to autonomously refine their understanding of organizational policies through iterative feedback and testing, rather than treating policies as immutable rules. The system addresses a critical AI alignment challenge where natural-language policy specifications contain ambiguities and gaps that cause agent behavior to diverge from intended requirements, achieving up to 82% closure of specification gaps compared to near-zero success with existing memory mechanisms.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Researchers introduce DiaFORGE, a three-stage framework for training LLMs to reliably invoke enterprise APIs by focusing on disambiguation between similar tools and underspecified arguments. Fine-tuned models achieved 27-49 percentage points higher tool-invocation success than GPT-4o and Claude-3.5-Sonnet, with an open corpus of 5,000 production-grade API specifications released for further research.

🧠 GPT-4🧠 Claude

AIBearisharXiv – CS AI · Apr 107/10

🧠

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Researchers introduce TraceSafe-Bench, a benchmark evaluating how well LLM guardrails detect safety risks across multi-step tool-using trajectories. The study reveals that guardrail effectiveness depends more on structural reasoning capabilities than semantic safety training, and that general-purpose LLMs outperform specialized safety models in detecting mid-execution vulnerabilities.

AINeutralarXiv – CS AI · Apr 77/10

🧠

Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

Researchers have identified a new security vulnerability called 'causality laundering' in AI tool-calling systems, where attackers can extract private information by learning from system denials and using that knowledge in subsequent tool calls. They developed the Agentic Reference Monitor (ARM) system to detect and prevent these attacks through enhanced provenance tracking.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

Researchers present ToolGraph, a framework that improves multi-turn tool-using AI agents through self-evolution via preference learning. By combining schema-derived topology with divergence-point preference optimization, the system achieves 16.8% improvement over baseline performance on benchmark tasks, with gains concentrated in airline and retail domains.

AINeutralarXiv – CS AI · Jun 106/10

🧠

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

Researchers introduce T1-Bench, a comprehensive benchmark for evaluating large language model-based agents across 25 domains with multi-step, multi-domain tasks that better reflect real-world complexity than existing benchmarks. The framework tests 12 models on structured reasoning, tool utilization, and conversational quality, with both automated and human evaluation methods.

AINeutralarXiv – CS AI · May 286/10

🧠

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Researchers introduce AsyncTool, a benchmark for evaluating how well LLM-based agents handle multiple concurrent tasks with realistic tool response delays. The study reveals that current AI agents struggle significantly with asynchronous multitasking, experiencing substantial performance degradation when tool feedback is delayed, highlighting a critical gap in real-world applicability.

AINeutralarXiv – CS AI · May 116/10

🧠

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Researchers introduce MIST, a synthetic dataset and framework for training voice-based AI assistants to control IoT devices in smart homes. The work reveals significant performance gaps between open and closed-weight multimodal LLMs on complex, real-world smart home tasks requiring spatiotemporal reasoning and mixed-initiative interaction.

AIBullisharXiv – CS AI · May 116/10

🧠

WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning

WebClipper is a new framework that optimizes web agent trajectories by pruning redundant reasoning steps through graph-based analysis, reducing tool-call rounds by approximately 20% while maintaining or improving accuracy. The approach models agent search processes as directed acyclic graphs and introduces an F-AE Score metric to measure the balance between accuracy and efficiency in web agent design.

AINeutralarXiv – CS AI · May 96/10

🧠

Inference-Time Budget Control for LLM Search Agents

Researchers propose a two-stage inference-time budget control system for LLM search agents that optimizes how language models allocate computational resources between tool calls and token generation during multi-hop question answering. The method uses Value-of-Information scoring to decide when to retrieve information, decompose questions, or commit to final answers, demonstrating consistent performance gains across multiple benchmarks and model sizes.

AINeutralarXiv – CS AI · Apr 146/10

🧠

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Researchers introduced FinTrace, a benchmark dataset with 800 expert-annotated trajectories for evaluating how large language models perform financial tool-calling tasks. The study reveals that while frontier LLMs excel at selecting appropriate tools, they struggle significantly with information utilization and generating accurate final outputs, pointing to a critical reasoning gap that persists even after fine-tuning with preference optimization techniques.