#tool-use News & Analysis

65 articles tagged with #tool-use. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

65 articles

AIBullisharXiv – CS AI · Apr 207/10

🧠

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Researchers introduce AgentV-RL, an agentic verifier framework that enhances reward modeling for large language models by combining bidirectional reasoning agents with tool-use capabilities. The system addresses critical limitations in LLM verification by enabling forward and backward tracing of solutions, achieving 25.2% performance gains over existing methods and positioning agentic reward modeling as a promising new paradigm.

AINeutralarXiv – CS AI · Apr 147/10

🧠

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Researchers introduce The Amazing Agent Race (AAR), a new benchmark revealing that LLM agents excel at tool-use but struggle with navigation tasks. Testing three agent frameworks on 1,400 complex, graph-structured puzzles shows the best achieve only 37.2% accuracy, with navigation errors (27-52% of failures) far outweighing tool-use failures (below 17%), exposing a critical blind spot in existing linear benchmarks.

🧠 Claude

AIBullisharXiv – CS AI · Apr 147/10

🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 107/10

🧠

Benchmarking LLM Tool-Use in the Wild

Researchers introduce WildToolBench, a new benchmark for evaluating large language models' ability to use tools in real-world scenarios. Testing 57 LLMs reveals that none exceed 15% accuracy, exposing significant gaps in current models' agentic capabilities when facing messy, multi-turn user interactions rather than simplified synthetic tasks.

AIBullisharXiv – CS AI · Apr 67/10

🧠

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

🏢 OpenAI🧠 o1🧠 o3

AINeutralarXiv – CS AI · Mar 177/10

🧠

CCTU: A Benchmark for Tool Use under Complex Constraints

Researchers introduce CCTU, a new benchmark for evaluating large language models' ability to use tools under complex constraints. The study reveals that even state-of-the-art LLMs achieve less than 20% task completion rates when strict constraint adherence is required, with models violating constraints in over 50% of cases.

AIBullisharXiv – CS AI · Mar 177/10

🧠

AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints

Researchers introduce AutoTool, a new reinforcement learning approach that enables AI agents to automatically scale their reasoning capabilities for tool use. The method uses entropy-based optimization and supervised fine-tuning to help models efficiently determine appropriate thinking lengths for simple versus complex problems, achieving 9.8% accuracy improvements while reducing computational overhead by 81%.

AIBullisharXiv – CS AI · Mar 117/10

🧠

AlphaApollo: A System for Deep Agentic Reasoning

AlphaApollo is a new AI reasoning system that addresses limitations in foundation models through multi-turn agentic reasoning, learning, and evolution components. The system demonstrates significant performance improvements across math reasoning benchmarks, with success rates exceeding 85% for tool calls and substantial gains from reinforcement learning across different model scales.

AINeutralarXiv – CS AI · Feb 277/103

🧠

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.

AIBullisharXiv – CS AI · Feb 277/107

🧠

OmniGAIA: Towards Native Omni-Modal AI Agents

Researchers introduce OmniGAIA, a comprehensive benchmark for evaluating omni-modal AI agents that can process video, audio, and image data simultaneously with complex reasoning capabilities. They also propose OmniAtlas, a foundation agent that enhances existing open-source models' ability to use tools across multiple modalities, marking progress toward more capable AI assistants.

AINeutralarXiv – CS AI · Jun 236/10

🧠

From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents

Researchers introduce KAPRO, a framework for evaluating whether LLM agents can accurately determine when to use external tools versus relying on internal knowledge. The study reveals that open-source models suffer from tool overuse due to pattern matching, while proprietary models show better self-awareness, highlighting a critical gap in current AI agent capabilities.

AINeutralarXiv – CS AI · Jun 236/10

🧠

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Researchers introduced PlanBench-XL, a benchmark testing how LLM agents plan and execute tasks across 1,665 tools in realistic scenarios. The study reveals significant vulnerabilities in current AI systems, with performance dropping from 51.9% to 11.36% accuracy when tools fail or behave unexpectedly, exposing critical gaps in adaptive planning capabilities.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 196/10

🧠

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

Researchers propose MENTOR, a reinforcement learning framework that improves how small language models learn tool-use capabilities from larger models by using flexible, process-aware rewards instead of rigid trajectory replication. The approach demonstrates better out-of-domain generalization than supervised fine-tuning and strict RL baselines in executable-tool environments.

AINeutralarXiv – CS AI · Jun 116/10

🧠

APPO: Agentic Procedural Policy Optimization

Researchers propose Agentic Procedural Policy Optimization (APPO), a new reinforcement learning method that improves how AI agents learn to use tools by identifying fine-grained decision points rather than relying on coarse tool-call boundaries. The approach achieves ~4 point improvements across 13 benchmarks while maintaining efficiency and interpretability.

AINeutralarXiv – CS AI · Jun 96/10

🧠

SecureClaw: Clawing Back Control of LLM Agents

SecureClaw introduces a dual-boundary security architecture designed to protect LLM agents from both unauthorized external actions and sensitive data exposure. The system uses opaque handles and a PREVIEW→COMMIT protocol to prevent language models from directly accessing secrets or executing unreviewed side effects, achieving zero attack success rates on major security benchmarks.

$COMMIT

AINeutralarXiv – CS AI · Jun 86/10

🧠

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

Researchers compare three orchestration approaches for AI agents handling customer-service workflows: declarative agents using natural-language skill files, imperative agents with programmatic state machines, and unscaffolded baseline agents. The study finds that retrieval quality is the dominant bottleneck, and declarative skills improve performance on procedural tasks only when evidence quality is high.

AIBullisharXiv – CS AI · Jun 86/10

🧠

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

Researchers propose TRUST, a reinforcement learning framework that improves LLM-based agent decision-making by incorporating uncertainty quantification into reward design. The approach addresses a critical flaw where standard RL weakens the distinction between correct and incorrect tool-use decisions, leading to overconfident mistakes and reduced exploration capabilities.

AINeutralarXiv – CS AI · Jun 56/10

🧠

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

Researchers propose TAPO (Tool-Aware Policy Optimization), a method that fixes credit misassignment problems in reinforcement learning for multimodal search agents. The technique improves training efficiency for AI systems that use tools, delivering consistent improvements across multiple benchmarks without requiring additional annotations or computational overhead.

AIBullishHugging Face Blog · Jun 46/10

🧠

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

EVA-Bench Data 2.0 expands evaluation capabilities across 3 domains with 121 tools and 213 scenarios, providing a comprehensive benchmarking framework for assessing AI agent performance. This release represents a significant advancement in standardized testing infrastructure for AI systems, enabling more rigorous evaluation of tool-use capabilities across diverse operational contexts.

AINeutralarXiv – CS AI · Jun 46/10

🧠

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Researchers introduced VAMPS, a benchmark dataset of 1,168 mathematical problems designed to test whether multimodal AI models can effectively use visualization tools to solve complex algebra and calculus problems. Surprisingly, the study found that direct analytical solving consistently outperformed graph-assisted approaches across multiple models, even when visualization should theoretically help.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

Researchers introduced GAIATrace, a token-level trace dataset documenting how state-of-the-art agentic AI systems (MiroThinker and OWL) execute general tasks, alongside Vidur-Agent, a simulator enabling reproducible system evaluation. This work addresses the black-box nature of agentic AI by providing unprecedented visibility into reasoning processes and system-level behavior.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Researchers discover that visual reasoning agents exhibit a 'tool-use collapse' phenomenon where models progressively abandon external visual tools while maintaining or improving task accuracy. By introducing entropy regularization to encourage diverse exploration rather than optimizing tool frequency, the team achieves superior performance on complex tasks like 3D spatial reasoning and medical visual question answering, suggesting diversity matters more than tool usage frequency.

AINeutralarXiv – CS AI · May 296/10

🧠

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Researchers demonstrate that jointly training language models for both reasoning and tool-use in agentic RL creates measurable performance interference. They introduce DART, a framework that decouples these capabilities through separate low-rank adaptation modules, achieving superior results across thirteen benchmarks and approaching theoretical performance limits.

AINeutralarXiv – CS AI · May 286/10

🧠

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

Researchers demonstrate that memory mechanisms in multi-trajectory LLM agents produce inconsistent results depending on the inference strategy used, revealing that previous evaluations conflated memory abstraction properties with inference method effects. The study systematically evaluates four memory methods across three inference strategies on tool-use benchmarks, showing that reflection, fact extraction, and observation injection each perform optimally under different conditions.

AINeutralarXiv – CS AI · May 286/10

🧠

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Researchers propose FeasiGen, a framework for automatically generating infeasible task benchmarks to evaluate whether AI agents recognize when tasks cannot be completed with available tools. Testing across nine models reveals critical weaknesses, with agents continuing execution on impossible tasks up to 73.9% of the time, though multi-agent architectures show improved performance.

← PrevPage 2 of 3Next →