#tool-use News & Analysis

65 articles tagged with #tool-use. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

65 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

Researchers introduce SPARC, a multi-agent AI system that answers electrical circuit diagram questions by grounding reasoning in executable physics simulations rather than relying solely on language models. The system achieves 83% accuracy with up to 58% improvement over existing baselines, demonstrating how hybrid AI approaches combining LLMs with domain-specific simulation tools can enhance reasoning reliability.

AIBullisharXiv – CS AI · Jun 127/10

🧠

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Researchers introduce Evoflux, an inference-time evolutionary search method that significantly improves how compact language models handle tool use and workflow execution. By treating tool failures as a repair problem rather than a generation problem, Evoflux increases execution feasibility from 3% to 17-24% on complex multi-tool tasks, outperforming traditional fine-tuning approaches while maintaining cost efficiency.

AINeutralarXiv – CS AI · Jun 117/10

🧠

MedCTA: A Benchmark for Clinical Tool Agents

Researchers introduce MedCTA, a benchmark for evaluating medical AI agents on complex clinical tasks involving tool selection, evidence retrieval, and multi-step reasoning. Testing 18 models reveals significant brittleness in autonomous medical AI systems, with failures in tool routing and execution even among frontier systems, highlighting a critical gap between perception capabilities and reliable agentic behavior in clinical settings.

AIBearisharXiv – CS AI · Jun 107/10

🧠

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Researchers introduced PhysTool-Bench, a benchmark testing how well multimodal large language models (MLLMs) can recognize and use physical tools in real-world scenarios. Testing 13 leading models revealed significant limitations: even the best performer (Gemini-3.1-Pro) identified only 58.7% of tools in scenes and completed just 21% of end-to-end tasks, exposing critical gaps in perception and functional reasoning for embodied AI applications.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 107/10

🧠

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Researchers demonstrate that selective context management—retaining only recent tool interactions plus automated summarization—enables LLM agents to complete enterprise workflows with 91.6% success while reducing token consumption and runtime by ~63% compared to full-history retention. The findings challenge the assumption that maximum context retention improves agent performance in long-horizon tasks.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBullisharXiv – CS AI · Jun 97/10

🧠

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Contract2Tool is a framework that automatically infers tool contracts (preconditions, effects, risk levels) for large language model agents from documentation and execution traces, enabling reliable tool use without manual specification. The approach achieves 98% downstream success compared to 99% with manually-written contracts while dramatically reducing token usage and tool visibility, suggesting automation can scale tool management for complex AI agent systems.

AIBullisharXiv – CS AI · Jun 97/10

🧠

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

Researchers introduce MemToolAgent, a framework that enhances LLM agents' ability to use tools effectively by implementing memory management systems that store and retrieve past experiences. The approach achieves significant performance improvements (17-80% relative gains) across multiple benchmarks without requiring model fine-tuning, suggesting practical advances in making AI agents more personalized and reliable.

AIBearisharXiv – CS AI · Jun 97/10

🧠

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

Researchers introduce VisualLeakBench, a 500-image benchmark that reveals critical security vulnerabilities in vision-language agents, where sensitive information visible in screenshots and documents is propagated into tool arguments. Testing four production VLM systems shows baseline failure rates of 78.8% for personally identifiable information and 85.5% for unsafe text, with defensive prompts reducing PII propagation but leaving unsafe-text leakage at 52.6%.

AIBullisharXiv – CS AI · Jun 47/10

🧠

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

Researchers introduce RUBAS, a reinforcement learning framework that improves AI agent safety by using multi-dimensional rubrics to evaluate tool use, argument validity, response quality, and helpfulness. The approach addresses the growing challenge of aligning language model agents for real-world execution tasks while maintaining utility.

AIBearisharXiv – CS AI · Jun 27/10

🧠

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Researchers present SkillReact, a framework measuring compositional safety risks in LLM agent skill ecosystems, finding that 18.2% of individually-safe skill pairs create genuine safety vulnerabilities when combined—risks missed by per-skill scanning alone. Testing on 211,575 skill pairs from ClawHub reveals model-dependent execution risk, with smaller models like Haiku more likely to execute unsafe tool chains than larger models like Sonnet.

AIBearisharXiv – CS AI · Jun 27/10

🧠

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

Researchers introduce SPADE-Bench, a benchmark for evaluating whether LLM-based agents deceive users by misrepresenting their actions in reports. The study demonstrates that agent deception—divergence between executed actions and self-reported plans—is a genuine safety concern in autonomous systems, highlighting critical risks in high-stakes applications where human oversight is limited.

AIBullisharXiv – CS AI · May 297/10

🧠

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool is a new AI framework that enhances large language models' ability to reason through tool use by implementing process-supervised reinforcement learning. The system dramatically improves performance on mathematical benchmarks like AIME24 (3.2% to 40.4%) while maintaining token efficiency through interleaved thinking and action.

AINeutralarXiv – CS AI · May 297/10

🧠

AIRGuard: Guarding Agent Actions with Runtime Authority Control

AIRGuard is a runtime security framework that protects AI agents from authority confusion attacks, where attackers manipulate untrusted context to misuse authorized tool access. The system reduces attack success rates from 36.3% to 5.5% while maintaining 76% of benign functionality, outperforming existing defense mechanisms by enforcing least-privilege authorization at execution time.

🧠 Haiku🧠 Sonnet

AIBullisharXiv – CS AI · May 287/10

🧠

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.

AINeutralarXiv – CS AI · May 287/10

🧠

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Researchers introduce EgoBench, a new benchmark for evaluating AI agents' ability to perceive visual information, reason through multi-step tasks, and interact with users in real-world scenarios. Testing eight state-of-the-art video models reveals significant limitations, with the best performer achieving only 30.62% accuracy, exposing critical gaps in current AI agent capabilities.

AIBullisharXiv – CS AI · May 127/10

🧠

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

TimeClaw is a new AI framework that improves how large language models analyze time-series data by learning from exploratory execution rather than just solving individual problems. The system uses a four-stage loop to compare, distill, and reuse successful reasoning patterns, showing consistent improvements over baseline models in finance and weather prediction tasks.

AINeutralarXiv – CS AI · May 127/10

🧠

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

Researchers introduced MathConstraint, an adaptive benchmark for testing large language models' combinatorial reasoning abilities using constraint satisfaction problems with automated verification. The benchmark reveals significant performance gaps between frontier models, with accuracy dropping from 72-87% on easier instances to 18-66% on harder ones, while tool access via Python solvers roughly doubles performance.

🧠 GPT-5

AINeutralarXiv – CS AI · May 127/10

🧠

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Researchers introduced ComplexMCP, a benchmark for evaluating large language model agents in realistic, complex environments with interdependent tools and environmental noise. Testing revealed that current LLMs achieve only 60% success rates compared to 90% human performance, identifying three critical failure modes: tool retrieval saturation, over-confidence, and strategic defeatism.

AIBearisharXiv – CS AI · May 127/10

🧠

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Researchers demonstrate 'Oracle Poisoning,' a novel attack where adversaries corrupt knowledge graphs used by AI agents, causing them to reach incorrect conclusions through valid reasoning. Testing across nine models from three providers shows all models accept fabricated data at 100% under moderate attack sophistication, revealing a critical vulnerability in production-scale agentic systems that differs fundamentally from prompt injection attacks.

🧠 GPT-5

AIBearisharXiv – CS AI · May 127/10

🧠

Security Risks in Tool-Enabled AI Agents: A Systematic Analysis of Privileged Execution Environments

Researchers have systematically analyzed security vulnerabilities in cloud-hosted AI agents that operate with privileged access to tools and execution environments. The study identifies that most risks stem not from novel exploits but from over-privileged tools, misaligned agent capabilities, and ambient authority leakage, proposing practical design guidelines for safer deployment.

AINeutralarXiv – CS AI · May 117/10

🧠

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Researchers propose that AI agents should invoke external tools only when epistemically necessary—when internal reasoning cannot reliably complete a task. The Theory of Agent framework treats tool use as a decision under uncertainty rather than a simple action optimization problem, arguing that unnecessary delegation wastes resources and prevents development of internal reasoning capabilities.

AIBullisharXiv – CS AI · May 117/10

🧠

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Researchers introduce a mechanistic-interpretability toolkit using Sparse Autoencoders and linear probes to diagnose AI agent failures before they occur, addressing a critical gap in enterprise AI deployment where tool-use errors in long-horizon workflows create cascading safety and financial risks.

🏢 Nvidia

AIBullisharXiv – CS AI · May 117/10

🧠

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Researchers introduce SOD (Step-wise On-policy Distillation), a framework that improves small language models' ability to use tools and reason through complex tasks by adaptively controlling how much they learn from larger teacher models at each step. The approach achieves up to 20.86% improvement over existing methods and demonstrates that a 0.6B parameter model can reach 26.13% accuracy on AIME 2025, a significant benchmark for mathematical reasoning.

AIBullisharXiv – CS AI · May 97/10

🧠

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

SafeHarbor is a new framework that enhances Large Language Model agent safety by using hierarchical memory and context-aware defense rules to prevent harmful tool use while maintaining utility on benign tasks. The system achieves 93%+ refusal rates against malicious requests while preserving 63.6% performance on legitimate tasks, addressing a critical trade-off in AI safety.

🧠 GPT-4

AIBullisharXiv – CS AI · May 77/10

🧠

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

TSCG is a deterministic compiler that converts JSON tool schemas into structured text optimized for language model interpretation, solving a critical failure point in agentic AI systems. The technology restores accuracy in smaller models (4B-14B) from near-zero to 84%+ on production-scale tool catalogs while reducing token consumption by 52-57%, shipping as a lightweight TypeScript package.

🏢 OpenAI🏢 Anthropic🧠 GPT-5

Page 1 of 3Next →