18 articles tagged with #tool-use. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv β CS AI Β· 3d ago7/10
π§ Researchers introduce The Amazing Agent Race (AAR), a new benchmark revealing that LLM agents excel at tool-use but struggle with navigation tasks. Testing three agent frameworks on 1,400 complex, graph-structured puzzles shows the best achieve only 37.2% accuracy, with navigation errors (27-52% of failures) far outweighing tool-use failures (below 17%), exposing a critical blind spot in existing linear benchmarks.
π§ Claude
AIBullisharXiv β CS AI Β· 3d ago7/10
π§ UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precisionβsurpassing GPT, Gemini, and Claude in specific benchmarks.
π§ Claudeπ§ Gemini
AINeutralarXiv β CS AI Β· Apr 107/10
π§ Researchers introduce WildToolBench, a new benchmark for evaluating large language models' ability to use tools in real-world scenarios. Testing 57 LLMs reveals that none exceed 15% accuracy, exposing significant gaps in current models' agentic capabilities when facing messy, multi-turn user interactions rather than simplified synthetic tasks.
AIBullisharXiv β CS AI Β· Apr 67/10
π§ Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.
π’ OpenAIπ§ o1π§ o3
AIBullisharXiv β CS AI Β· Mar 177/10
π§ Researchers introduce AutoTool, a new reinforcement learning approach that enables AI agents to automatically scale their reasoning capabilities for tool use. The method uses entropy-based optimization and supervised fine-tuning to help models efficiently determine appropriate thinking lengths for simple versus complex problems, achieving 9.8% accuracy improvements while reducing computational overhead by 81%.
AINeutralarXiv β CS AI Β· Mar 177/10
π§ Researchers introduce CCTU, a new benchmark for evaluating large language models' ability to use tools under complex constraints. The study reveals that even state-of-the-art LLMs achieve less than 20% task completion rates when strict constraint adherence is required, with models violating constraints in over 50% of cases.
AIBullisharXiv β CS AI Β· Mar 117/10
π§ AlphaApollo is a new AI reasoning system that addresses limitations in foundation models through multi-turn agentic reasoning, learning, and evolution components. The system demonstrates significant performance improvements across math reasoning benchmarks, with success rates exceeding 85% for tool calls and substantial gains from reinforcement learning across different model scales.
AIBullisharXiv β CS AI Β· Feb 277/107
π§ Researchers introduce OmniGAIA, a comprehensive benchmark for evaluating omni-modal AI agents that can process video, audio, and image data simultaneously with complex reasoning capabilities. They also propose OmniAtlas, a foundation agent that enhances existing open-source models' ability to use tools across multiple modalities, marking progress toward more capable AI assistants.
AINeutralarXiv β CS AI Β· Feb 277/103
π§ Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.
AINeutralarXiv β CS AI Β· 3d ago6/10
π§ Researchers apply psychometric analysis to large language model benchmarks, discovering that AI's general intelligence factor (G-factor) peaked around 2023-2024 before fragmenting as models specialized in reasoning tasks. The finding suggests AI development is shifting from unified capability improvement toward specialized tool-using systems, challenging assumptions about monolithic AGI progress.
AINeutralarXiv β CS AI Β· Mar 176/10
π§ Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.
AINeutralarXiv β CS AI Β· Mar 176/10
π§ Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.
π§ Gemini
AINeutralarXiv β CS AI Β· Mar 36/108
π§ Researchers released ASTRA-bench, a new benchmark for evaluating AI agents' ability to handle complex, multi-step reasoning with personal context and tool usage. Testing revealed that current state-of-the-art models like Claude-4.5-Opus and DeepSeek-V3.2 show significant performance degradation in high-complexity scenarios.
AIBullisharXiv β CS AI Β· Mar 36/107
π§ Researchers introduce CoVe, a framework for training interactive tool-use AI agents that uses constraint-guided verification to generate high-quality training data. The compact CoVe-4B model achieves competitive performance with models 17 times larger on benchmark tests, with the team open-sourcing code, models, and 12K training trajectories.
AIBullishOpenAI News Β· Aug 56/106
π§ A new company has released gpt-oss-120b and gpt-oss-20b, two open-weight language models under Apache 2.0 license that deliver strong performance at low cost. The models excel at reasoning tasks and tool use while being optimized for efficient deployment on consumer hardware.
AIBullishLil'Log (Lilian Weng) Β· Jun 236/10
π§ The article explores LLM-powered autonomous agents that use large language models as core controllers, going beyond text generation to serve as general problem solvers. Key systems like AutoGPT, GPT-Engineer, and BabyAGI demonstrate the potential of agents with planning, memory, and tool-use capabilities.
AIBullishOpenAI News Β· Sep 176/107
π§ Researchers observed AI agents developing increasingly complex strategies through multi-agent interaction in a hide-and-seek game environment. The agents independently discovered six distinct strategies and counterstrategies, some of which were previously unknown to be possible in the environment, suggesting emergent complexity from self-supervised learning.
AINeutralHugging Face Blog Β· Aug 121/106
π§ The article title 'Tool Use, Unified' appears to reference a development in AI tooling or unified systems. However, without article content provided, specific details about the announcement, implementation, or market impact cannot be analyzed.