A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
Researchers introduce TASTE, an automated method for generating challenging AI agent benchmarks by reversing traditional task construction—starting from tool sequences rather than natural language descriptions. The resulting τc-Bench significantly increases difficulty and tool-use diversity, revealing that high performance on existing saturated benchmarks like τ2-Bench doesn't guarantee robust agent capabilities.
The advancement of AI agent capabilities has exposed a critical weakness in current evaluation methodology: existing benchmarks become saturated, masking the true limitations of state-of-the-art models. TASTE addresses this by automating benchmark generation through an Adaptive Contrastive n-gram model that samples valid tool sequences and clusters them for representative coverage. This reversal of the traditional task-construction pipeline—which typically begins with natural language scenarios—captures substantially broader patterns of tool interaction that agents must navigate.
The performance degradation observed when moving from τ2-Bench to τc-Bench is striking and revealing. Models like Gemini-3-Flash, achieving 82-94% accuracy on the original benchmark, plummeted to 28-61% on the new tasks. This suggests that benchmark saturation has created a false sense of progress, with high scores reflecting the narrowness of existing evaluation protocols rather than genuine robustness. The doubling of unique tool combinations required by TASTE-generated tasks provides a more comprehensive stress test of agent capabilities across diverse scenarios.
For the AI development community, this work highlights an urgent problem: as agents improve, evaluation infrastructure must evolve proportionally to remain meaningful. The automated, scalable nature of TASTE positions it as a potential standard for continuous benchmarking. This creates immediate value for researchers calibrating agent performance accurately and comparing models across genuinely challenging problem spaces. The methodology's emphasis on tool-use diversity also reflects real-world deployment requirements, where agents must handle unpredictable combinations of capabilities rather than narrow, well-practiced scenarios.
- →TASTE automates benchmark generation by evolving tool sequences rather than mapping natural language tasks, capturing broader agent capabilities.
- →Top-performing models show 30-60% accuracy drops on TASTE-generated benchmarks versus existing benchmarks, indicating previous saturation masked true limitations.
- →The new τc-Bench more than doubles unique tool combinations agents must execute, providing genuinely challenging evaluation coverage.
- →Automated, scalable benchmark generation enables continuous evaluation of advancing agent capabilities without manual labor-intensive task construction.
- →High benchmark scores may reflect evaluation saturation rather than robust problem-solving ability, requiring methodological shifts in agent assessment.