AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce ClawGuard, a runtime security framework that protects tool-augmented LLM agents from indirect prompt injection attacks by enforcing user-confirmed rules at tool-call boundaries. The framework blocks malicious instructions embedded in tool responses without requiring model modifications, demonstrating robust protection across multiple state-of-the-art language models.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers introduced AgentMath, a new AI framework that combines language models with code interpreters to solve complex mathematical problems more efficiently than current Large Reasoning Models. The system achieves state-of-the-art performance on mathematical competition benchmarks, with AgentMath-30B-A3B reaching 90.6% accuracy on AIME24 while remaining competitive with much larger models like OpenAI-o3.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers introduce SciVisAgentSkills, a framework of reusable agent capabilities designed to enhance AI coding agents for scientific data visualization tasks across tools like ParaView and napari. Testing on 108 benchmark tasks demonstrates that these domain-specific skills improve agent performance and token efficiency, suggesting that structured procedural knowledge is essential for reliable long-horizon scientific workflows.
🧠 Claude
AINeutralarXiv – CS AI · Jun 36/10
🧠Researchers introduce ToolGate, a control mechanism that optimizes token efficiency in vision-language agents by intelligently deciding when to execute tool calls versus skip them. The system reduces computational costs to 64-69% of baseline while maintaining accuracy, demonstrating that selective tool usage outperforms indiscriminate execution in AI agents.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce DeepTumorVQA, a comprehensive benchmark for evaluating medical AI vision-language models on 3D CT tumor analysis through 476K hierarchical questions across four diagnostic stages. The study reveals that measurement accuracy is the critical bottleneck in medical AI reasoning, and that tool-augmented agents significantly outperform models working without external resources.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce TEA-Bench, the first interactive benchmark for evaluating how external tools improve emotional support conversation (ESC) systems. Testing nine LLMs reveals that tool augmentation reduces hallucination and improves support quality, but effectiveness depends heavily on model capacity—stronger models leverage tools more effectively than weaker ones.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers introduce Profile-Then-Reason (PTR), a new framework for AI language agents that use external tools, which reduces computational overhead by pre-planning workflows rather than recomputing after each step. The approach limits language model calls to 2-3 times maximum and shows superior performance in 16 of 24 test configurations compared to reactive execution methods.