y0news
AnalyticsDigestsSourcesRSSAICrypto
#multi-step-tasks3 articles
3 articles
AINeutralarXiv โ€“ CS AI ยท Feb 277/103
๐Ÿง 

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.

AIBullishGoogle DeepMind Blog ยท Oct 237/106
๐Ÿง 

Gemini Robotics 1.5 brings AI agents into the physical world

Gemini Robotics 1.5 introduces AI agents capable of operating in physical environments, enabling robots to perceive, plan, think, use tools and act autonomously. This development represents a significant advancement in bringing artificial intelligence beyond digital interfaces into real-world applications for complex multi-step tasks.

AIBullishOpenAI News ยท Feb 26/105
๐Ÿง 

Introducing deep research

A new AI research agent has been launched that can synthesize large amounts of online information and complete complex multi-step research tasks through advanced reasoning capabilities. The tool is currently available to Pro users with rollout planned for Plus and Team subscribers.