HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Researchers introduced HealthAdminBench, a new evaluation framework with 135 tasks across realistic healthcare administration workflows, revealing that current AI agents achieve only 36.3% end-to-end success despite strong individual subtask performance. The benchmark demonstrates a critical gap between AI capabilities and the reliability requirements for automating healthcare administrative processes worth over $1 trillion annually.
HealthAdminBench addresses a significant blind spot in AI agent evaluation by focusing on healthcare administration rather than clinical applications. The benchmark's four GUI environments—EHR systems, payer portals, and fax platforms—reflect the messy reality of administrative work, decomposing 135 tasks into 1,698 verifiable subtasks. This granular approach reveals a crucial insight: agents can handle individual steps well (GPT-5.4 achieves 82.8% subtask success) but fail at orchestrating complete workflows, with the best performer reaching only 36.3% end-to-end success.
This discrepancy matters because healthcare administration is notoriously labor-intensive and error-prone, representing over $1 trillion in annual spending. Current LLM-based agents show promise for reducing administrative burden but lack the reliability needed for real deployment where mistakes carry compliance and patient care implications. The benchmark effectively quantifies what practitioners have suspected: end-to-end reliability is substantially harder than isolated task performance.
For the AI development community, HealthAdminBench establishes a rigorous testing ground that could accelerate progress toward production-ready agents. Healthcare systems exploring AI automation will likely reference these metrics when evaluating whether current solutions merit implementation. The research suggests that breakthroughs in task planning, error recovery, and multi-step reasoning remain necessary before widespread deployment.
Looking forward, developers will likely focus on closing the gap between subtask and end-to-end performance through improved prompting strategies, better state management, and more robust error handling. Subsequent iterations of this benchmark will likely become industry-standard for evaluating administrative automation solutions.
- →Current best-performing AI agents achieve only 36.3% end-to-end success on healthcare administrative tasks despite 82.8% subtask accuracy, exposing a critical reliability gap.
- →HealthAdminBench provides 1,698 evaluation points across realistic healthcare workflows including EHR, payer portals, and fax systems.
- →The benchmark reveals that handling individual steps differs fundamentally from orchestrating complete multi-step workflows in complex domain environments.
- →Healthcare administration's $1 trillion annual spending makes administrative automation a high-value but high-stakes application for AI agents.
- →This research establishes a foundation for measuring progress toward safe, reliable healthcare administrative automation over the coming years.