AINeutralarXiv – CS AI · May 125/10
🧠Researchers demonstrate that preserving API request/response trajectories during continual learning significantly improves tool-use performance in language models. Fine-tuning Llama 3.1 8B on sequential API domains shows trajectory supervision achieves 56.9% accuracy versus 39.2% without intermediate context, though at a 25.1% token cost increase.
🧠 Llama
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce PruneTIR, an inference-time optimization framework that improves tool-integrated reasoning in large language models by pruning failed trajectories, resampling tool calls, and suspending tool usage when errors persist. The approach enhances LLM performance without requiring additional training, demonstrating significant improvements in accuracy and efficiency.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce ChemCost, a benchmark for evaluating LLM agents on chemical cost estimation from reaction descriptions. The study reveals that even frontier LLMs achieve only 50.6% accuracy on clean inputs and degrade significantly with realistic noise, exposing brittleness in parsing, evidence integration, and tool use despite access to domain-specific tools.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduced AgentEscapeBench, a benchmark that evaluates how well LLM-based agents can reason through complex, multi-step tasks requiring external tool use and long-range dependency tracking. Testing 16 LLM agents against 270 escape-room-style problems revealed significant performance degradation as task complexity increased, with the best models dropping from 90% success to 60% as dependency depth tripled, highlighting a critical limitation in current AI agent capabilities.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers demonstrate that tool-augmented reasoning in LLM agents doesn't always outperform chain-of-thought reasoning, especially when semantic noise is present. A proposed "tool-use tax" reveals that protocol overhead and formatting costs often negate performance gains from tool execution, with a lightweight gating solution offering only partial mitigation.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers apply psychometric analysis to large language model benchmarks, discovering that AI's general intelligence factor (G-factor) peaked around 2023-2024 before fragmenting as models specialized in reasoning tasks. The finding suggests AI development is shifting from unified capability improvement toward specialized tool-using systems, challenging assumptions about monolithic AGI progress.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.
🧠 Gemini
AINeutralarXiv – CS AI · Mar 36/108
🧠Researchers released ASTRA-bench, a new benchmark for evaluating AI agents' ability to handle complex, multi-step reasoning with personal context and tool usage. Testing revealed that current state-of-the-art models like Claude-4.5-Opus and DeepSeek-V3.2 show significant performance degradation in high-complexity scenarios.
AIBullisharXiv – CS AI · Mar 36/107
🧠Researchers introduce CoVe, a framework for training interactive tool-use AI agents that uses constraint-guided verification to generate high-quality training data. The compact CoVe-4B model achieves competitive performance with models 17 times larger on benchmark tests, with the team open-sourcing code, models, and 12K training trajectories.
AIBullishOpenAI News · Aug 56/106
🧠A new company has released gpt-oss-120b and gpt-oss-20b, two open-weight language models under Apache 2.0 license that deliver strong performance at low cost. The models excel at reasoning tasks and tool use while being optimized for efficient deployment on consumer hardware.
AIBullishLil'Log (Lilian Weng) · Jun 236/10
🧠The article explores LLM-powered autonomous agents that use large language models as core controllers, going beyond text generation to serve as general problem solvers. Key systems like AutoGPT, GPT-Engineer, and BabyAGI demonstrate the potential of agents with planning, memory, and tool-use capabilities.
AIBullishOpenAI News · Sep 176/107
🧠Researchers observed AI agents developing increasingly complex strategies through multi-agent interaction in a hide-and-seek game environment. The agents independently discovered six distinct strategies and counterstrategies, some of which were previously unknown to be possible in the environment, suggesting emergent complexity from self-supervised learning.
AINeutralHugging Face Blog · Aug 121/106
🧠The article title 'Tool Use, Unified' appears to reference a development in AI tooling or unified systems. However, without article content provided, specific details about the announcement, implementation, or market impact cannot be analyzed.