y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#tool-use News & Analysis

18 articles tagged with #tool-use. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

18 articles
AINeutralarXiv – CS AI Β· 3d ago7/10
🧠

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Researchers introduce The Amazing Agent Race (AAR), a new benchmark revealing that LLM agents excel at tool-use but struggle with navigation tasks. Testing three agent frameworks on 1,400 complex, graph-structured puzzles shows the best achieve only 37.2% accuracy, with navigation errors (27-52% of failures) far outweighing tool-use failures (below 17%), exposing a critical blind spot in existing linear benchmarks.

🧠 Claude
AIBullisharXiv – CS AI Β· 3d ago7/10
🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precisionβ€”surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI Β· Apr 107/10
🧠

Benchmarking LLM Tool-Use in the Wild

Researchers introduce WildToolBench, a new benchmark for evaluating large language models' ability to use tools in real-world scenarios. Testing 57 LLMs reveals that none exceed 15% accuracy, exposing significant gaps in current models' agentic capabilities when facing messy, multi-turn user interactions rather than simplified synthetic tasks.

AIBullisharXiv – CS AI Β· Apr 67/10
🧠

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

🏒 OpenAI🧠 o1🧠 o3
AIBullisharXiv – CS AI Β· Mar 177/10
🧠

AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints

Researchers introduce AutoTool, a new reinforcement learning approach that enables AI agents to automatically scale their reasoning capabilities for tool use. The method uses entropy-based optimization and supervised fine-tuning to help models efficiently determine appropriate thinking lengths for simple versus complex problems, achieving 9.8% accuracy improvements while reducing computational overhead by 81%.

AINeutralarXiv – CS AI Β· Mar 177/10
🧠

CCTU: A Benchmark for Tool Use under Complex Constraints

Researchers introduce CCTU, a new benchmark for evaluating large language models' ability to use tools under complex constraints. The study reveals that even state-of-the-art LLMs achieve less than 20% task completion rates when strict constraint adherence is required, with models violating constraints in over 50% of cases.

AIBullisharXiv – CS AI Β· Mar 117/10
🧠

AlphaApollo: A System for Deep Agentic Reasoning

AlphaApollo is a new AI reasoning system that addresses limitations in foundation models through multi-turn agentic reasoning, learning, and evolution components. The system demonstrates significant performance improvements across math reasoning benchmarks, with success rates exceeding 85% for tool calls and substantial gains from reinforcement learning across different model scales.

AIBullisharXiv – CS AI Β· Feb 277/107
🧠

OmniGAIA: Towards Native Omni-Modal AI Agents

Researchers introduce OmniGAIA, a comprehensive benchmark for evaluating omni-modal AI agents that can process video, audio, and image data simultaneously with complex reasoning capabilities. They also propose OmniAtlas, a foundation agent that enhances existing open-source models' ability to use tools across multiple modalities, marking progress toward more capable AI assistants.

AINeutralarXiv – CS AI Β· Feb 277/103
🧠

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.

AINeutralarXiv – CS AI Β· 3d ago6/10
🧠

The Rise and Fall of $G$ in AGI

Researchers apply psychometric analysis to large language model benchmarks, discovering that AI's general intelligence factor (G-factor) peaked around 2023-2024 before fragmenting as models specialized in reasoning tasks. The finding suggests AI development is shifting from unified capability improvement toward specialized tool-using systems, challenging assumptions about monolithic AGI progress.

AINeutralarXiv – CS AI Β· Mar 176/10
🧠

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.

AINeutralarXiv – CS AI Β· Mar 176/10
🧠

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.

🧠 Gemini
AIBullisharXiv – CS AI Β· Mar 36/107
🧠

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Researchers introduce CoVe, a framework for training interactive tool-use AI agents that uses constraint-guided verification to generate high-quality training data. The compact CoVe-4B model achieves competitive performance with models 17 times larger on benchmark tests, with the team open-sourcing code, models, and 12K training trajectories.

AIBullishOpenAI News Β· Aug 56/106
🧠

Introducing gpt-oss

A new company has released gpt-oss-120b and gpt-oss-20b, two open-weight language models under Apache 2.0 license that deliver strong performance at low cost. The models excel at reasoning tasks and tool use while being optimized for efficient deployment on consumer hardware.

AIBullishLil'Log (Lilian Weng) Β· Jun 236/10
🧠

LLM Powered Autonomous Agents

The article explores LLM-powered autonomous agents that use large language models as core controllers, going beyond text generation to serve as general problem solvers. Key systems like AutoGPT, GPT-Engineer, and BabyAGI demonstrate the potential of agents with planning, memory, and tool-use capabilities.

AIBullishOpenAI News Β· Sep 176/107
🧠

Emergent tool use from multi-agent interaction

Researchers observed AI agents developing increasingly complex strategies through multi-agent interaction in a hide-and-seek game environment. The agents independently discovered six distinct strategies and counterstrategies, some of which were previously unknown to be possible in the environment, suggesting emergent complexity from self-supervised learning.

AINeutralHugging Face Blog Β· Aug 121/106
🧠

Tool Use, Unified

The article title 'Tool Use, Unified' appears to reference a development in AI tooling or unified systems. However, without article content provided, specific details about the announcement, implementation, or market impact cannot be analyzed.