When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration
Researchers introduce Parallel WebBench, a benchmark revealing critical failure modes in long-horizon web agents that produce confident but incomplete answers. Despite significant improvements in completion rates using GRPO training on synthetic data, agents still struggle with evidence grounding and synthesis accuracy, exposing gaps between appearing successful and actually solving tasks correctly.