AIBearisharXiv – CS AI · 7h ago7/10
🧠
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
Researchers introduced MedCUA-Bench, a new benchmark for evaluating AI agents performing clinical computer tasks across 18 medical scenarios. The benchmark reveals significant performance gaps, with top closed-source models achieving only 54.2% success and open-source agents averaging just 2.5%, highlighting the unpreparedness of current AI systems for reliable medical software automation.