AINeutralarXiv – CS AI · 3h ago7/10
🧠
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
Researchers introduce TASTE, an automated method for generating challenging AI agent benchmarks by reversing traditional task construction—starting from tool sequences rather than natural language descriptions. The resulting τc-Bench significantly increases difficulty and tool-use diversity, revealing that high performance on existing saturated benchmarks like τ2-Bench doesn't guarantee robust agent capabilities.
🧠 Gemini