AINeutralarXiv โ CS AI ยท 7h ago6/10
๐ง
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.