AIBullisharXiv – CS AI · 3d ago7/10
🧠DeepTool is a new AI framework that enhances large language models' ability to reason through tool use by implementing process-supervised reinforcement learning. The system dramatically improves performance on mathematical benchmarks like AIME24 (3.2% to 40.4%) while maintaining token efficiency through interleaved thinking and action.
AINeutralarXiv – CS AI · 3d ago7/10
🧠AIRGuard is a runtime security framework that protects AI agents from authority confusion attacks, where attackers manipulate untrusted context to misuse authorized tool access. The system reduces attack success rates from 36.3% to 5.5% while maintaining 76% of benign functionality, outperforming existing defense mechanisms by enforcing least-privilege authorization at execution time.
🧠 Haiku🧠 Sonnet
AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers introduce EgoBench, a new benchmark for evaluating AI agents' ability to perceive visual information, reason through multi-step tasks, and interact with users in real-world scenarios. Testing eight state-of-the-art video models reveals significant limitations, with the best performer achieving only 30.62% accuracy, exposing critical gaps in current AI agent capabilities.
AIBullisharXiv – CS AI · 4d ago7/10
🧠SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.
AIBullisharXiv – CS AI · May 127/10
🧠TimeClaw is a new AI framework that improves how large language models analyze time-series data by learning from exploratory execution rather than just solving individual problems. The system uses a four-stage loop to compare, distill, and reuse successful reasoning patterns, showing consistent improvements over baseline models in finance and weather prediction tasks.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers introduced MathConstraint, an adaptive benchmark for testing large language models' combinatorial reasoning abilities using constraint satisfaction problems with automated verification. The benchmark reveals significant performance gaps between frontier models, with accuracy dropping from 72-87% on easier instances to 18-66% on harder ones, while tool access via Python solvers roughly doubles performance.
🧠 GPT-5
AIBearisharXiv – CS AI · May 127/10
🧠Researchers have systematically analyzed security vulnerabilities in cloud-hosted AI agents that operate with privileged access to tools and execution environments. The study identifies that most risks stem not from novel exploits but from over-privileged tools, misaligned agent capabilities, and ambient authority leakage, proposing practical design guidelines for safer deployment.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers introduced ComplexMCP, a benchmark for evaluating large language model agents in realistic, complex environments with interdependent tools and environmental noise. Testing revealed that current LLMs achieve only 60% success rates compared to 90% human performance, identifying three critical failure modes: tool retrieval saturation, over-confidence, and strategic defeatism.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers demonstrate 'Oracle Poisoning,' a novel attack where adversaries corrupt knowledge graphs used by AI agents, causing them to reach incorrect conclusions through valid reasoning. Testing across nine models from three providers shows all models accept fabricated data at 100% under moderate attack sophistication, revealing a critical vulnerability in production-scale agentic systems that differs fundamentally from prompt injection attacks.
🧠 GPT-5
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce a mechanistic-interpretability toolkit using Sparse Autoencoders and linear probes to diagnose AI agent failures before they occur, addressing a critical gap in enterprise AI deployment where tool-use errors in long-horizon workflows create cascading safety and financial risks.
🏢 Nvidia
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce SOD (Step-wise On-policy Distillation), a framework that improves small language models' ability to use tools and reason through complex tasks by adaptively controlling how much they learn from larger teacher models at each step. The approach achieves up to 20.86% improvement over existing methods and demonstrates that a 0.6B parameter model can reach 26.13% accuracy on AIME 2025, a significant benchmark for mathematical reasoning.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers propose that AI agents should invoke external tools only when epistemically necessary—when internal reasoning cannot reliably complete a task. The Theory of Agent framework treats tool use as a decision under uncertainty rather than a simple action optimization problem, arguing that unnecessary delegation wastes resources and prevents development of internal reasoning capabilities.
AIBullisharXiv – CS AI · May 97/10
🧠SafeHarbor is a new framework that enhances Large Language Model agent safety by using hierarchical memory and context-aware defense rules to prevent harmful tool use while maintaining utility on benign tasks. The system achieves 93%+ refusal rates against malicious requests while preserving 63.6% performance on legitimate tasks, addressing a critical trade-off in AI safety.
🧠 GPT-4
AIBullisharXiv – CS AI · May 77/10
🧠TSCG is a deterministic compiler that converts JSON tool schemas into structured text optimized for language model interpretation, solving a critical failure point in agentic AI systems. The technology restores accuracy in smaller models (4B-14B) from near-zero to 84%+ on production-scale tool catalogs while reducing token consumption by 52-57%, shipping as a lightweight TypeScript package.
🏢 OpenAI🏢 Anthropic🧠 GPT-5
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce AgentV-RL, an agentic verifier framework that enhances reward modeling for large language models by combining bidirectional reasoning agents with tool-use capabilities. The system addresses critical limitations in LLM verification by enabling forward and backward tracing of solutions, achieving 25.2% performance gains over existing methods and positioning agentic reward modeling as a promising new paradigm.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce The Amazing Agent Race (AAR), a new benchmark revealing that LLM agents excel at tool-use but struggle with navigation tasks. Testing three agent frameworks on 1,400 complex, graph-structured puzzles shows the best achieve only 37.2% accuracy, with navigation errors (27-52% of failures) far outweighing tool-use failures (below 17%), exposing a critical blind spot in existing linear benchmarks.
🧠 Claude
AIBullisharXiv – CS AI · Apr 147/10
🧠UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.
🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Apr 107/10
🧠Researchers introduce WildToolBench, a new benchmark for evaluating large language models' ability to use tools in real-world scenarios. Testing 57 LLMs reveals that none exceed 15% accuracy, exposing significant gaps in current models' agentic capabilities when facing messy, multi-turn user interactions rather than simplified synthetic tasks.
AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.
🏢 OpenAI🧠 o1🧠 o3
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers introduce CCTU, a new benchmark for evaluating large language models' ability to use tools under complex constraints. The study reveals that even state-of-the-art LLMs achieve less than 20% task completion rates when strict constraint adherence is required, with models violating constraints in over 50% of cases.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce AutoTool, a new reinforcement learning approach that enables AI agents to automatically scale their reasoning capabilities for tool use. The method uses entropy-based optimization and supervised fine-tuning to help models efficiently determine appropriate thinking lengths for simple versus complex problems, achieving 9.8% accuracy improvements while reducing computational overhead by 81%.
AIBullisharXiv – CS AI · Mar 117/10
🧠AlphaApollo is a new AI reasoning system that addresses limitations in foundation models through multi-turn agentic reasoning, learning, and evolution components. The system demonstrates significant performance improvements across math reasoning benchmarks, with success rates exceeding 85% for tool calls and substantial gains from reinforcement learning across different model scales.
AIBullisharXiv – CS AI · Feb 277/107
🧠Researchers introduce OmniGAIA, a comprehensive benchmark for evaluating omni-modal AI agents that can process video, audio, and image data simultaneously with complex reasoning capabilities. They also propose OmniAtlas, a foundation agent that enhances existing open-source models' ability to use tools across multiple modalities, marking progress toward more capable AI assistants.
AINeutralarXiv – CS AI · Feb 277/103
🧠Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers demonstrate that jointly training language models for both reasoning and tool-use in agentic RL creates measurable performance interference. They introduce DART, a framework that decouples these capabilities through separate low-rank adaptation modules, achieving superior results across thirteen benchmarks and approaching theoretical performance limits.