AINeutralarXiv โ CS AI ยท 4d ago6/104
๐ง
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
Researchers introduced EHR-ChatQA, a new benchmark for testing AI agents that interact with Electronic Health Record databases through natural language queries. The benchmark reveals significant reliability gaps in current state-of-the-art LLMs, with success rates dropping substantially when consistency across multiple trials is required.