AIBearisharXiv – CS AI · 8h ago6/10
🧠
EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning
Researchers introduce EHR-Complex, a large-scale benchmark with 52K tasks for evaluating AI clinical agents on real-world electronic health record analysis. Testing reveals significant limitations, with top models achieving only 62.3% accuracy and exposure of three dominant failure modes: SQL logic errors, medical code lookup failures, and semantic misunderstandings.