AINeutralarXiv โ CS AI ยท 5h ago2
๐ง
ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
Researchers released ASTRA-bench, a new benchmark for evaluating AI agents' ability to handle complex, multi-step reasoning with personal context and tool usage. Testing revealed that current state-of-the-art models like Claude-4.5-Opus and DeepSeek-V3.2 show significant performance degradation in high-complexity scenarios.