#clinical-ai News & Analysis

131 articles tagged with #clinical-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

131 articles

AINeutralarXiv – CS AI · Jun 17/10

🧠

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Researchers introduce EHRBench, an automated benchmark containing nearly 1 million QA items derived from real patient electronic health records to evaluate large language models on clinical decision-making tasks. The framework combines LLM-based template generation with knowledge-base verification to assess model performance on diagnosis, treatment, and prognosis at scale while maintaining reliability.

AINeutralarXiv – CS AI · May 297/10

🧠

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Researchers introduced MedCase-Structured, a synthetic dataset that converts unstructured clinical text into standardized HL7 FHIR format for evaluating large language models in realistic healthcare settings. The study reveals that LLMs perform significantly worse on structured clinical data than plain text, highlighting a critical gap between academic benchmarks and real-world deployment requirements.

AIBullisharXiv – CS AI · May 297/10

🧠

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

Researchers introduce GRASP, a method for improving large language model agents through controlled skill library updates that prevent performance regression. Tested across five base models on clinical benchmarks, GRASP achieves dramatic improvements (40.6% to 88.8% on MedAgentBench) while maintaining stability, outperforming existing self-improvement approaches by significant margins.

🧠 GPT-4🧠 GPT-5🧠 Gemini

AIBullisharXiv – CS AI · May 297/10

🧠

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent introduces a framework that combines interpretable prototype networks with privacy-aware AI workflows to generate clinically accurate medical reports without the hallucination issues common in standard RAG systems. The approach achieves 91.2% faithfulness in clinical documentation while protecting patient privacy through k-anonymity and ℓ-diversity constraints.

AIBullisharXiv – CS AI · May 297/10

🧠

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM³oE introduces a novel AI architecture that combines multimodal mixture-of-experts with interpretable concept bottlenecks for computational pathology, enabling medical AI models to provide transparent reasoning while maintaining competitive performance. The framework improves diagnostic accuracy in data-limited scenarios and demonstrates practical alignment with clinical decision-making processes.

AIBullisharXiv – CS AI · May 297/10

🧠

Small Agent Group is the Future of Digital Health

Researchers propose Small Agent Group (SAG), a collaborative multi-agent approach to clinical AI that outperforms single large language models while reducing deployment costs and improving reliability. The study challenges the prevailing 'scaling-first' philosophy in digital health, suggesting that distributed reasoning across specialized agents can achieve superior clinical outcomes more efficiently.

AIBullisharXiv – CS AI · May 287/10

🧠

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

Researchers benchmark Liquid Neural Networks (LNNs) against traditional LSTMs across four sequential data domains, finding that LNNs deliver superior parameter efficiency and robustness in handling sparse, temporal data—particularly valuable for clinical applications. The study demonstrates LNNs' continuous-time modeling approach outperforms discrete-step RNNs when data is missing or irregularly sampled, suggesting significant implications for real-world AI deployment in healthcare and edge computing.

AIBullisharXiv – CS AI · May 287/10

🧠

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

Researchers introduce ShaQ, a Shapley-value-based framework that identifies which specific parts of user input cause uncertainty in large language models, rather than just flagging overall uncertainty. The method achieves state-of-the-art ambiguity detection on multiple benchmarks and demonstrates practical value in high-stakes domains like clinical settings by enabling targeted input clarification.

AIBearisharXiv – CS AI · May 277/10

🧠

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

GlobalDentBench introduces the first multinational dental benchmark with 8,978 expert-validated questions across 14 specialties, revealing that current LLMs face severe limitations in clinical reasoning with a 31.01% unsafe recommendation rate. The study demonstrates performance degrades sharply as reasoning complexity increases, with accuracy dropping from 81.34% on multiple-choice to just 22.34% on case-based questions, highlighting critical safety gaps before LLMs can be deployed in healthcare.

AIBullisharXiv – CS AI · May 277/10

🧠

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1 introduces a reinforcement learning framework for volumetric reasoning segmentation in 3D medical imaging, decoupling evidence grounding from mask generation to improve interpretability and accuracy. The system uses an LVLM to identify key 2D evidence anchors before propagating them into coherent 3D segmentations, achieving state-of-the-art results on multiple medical imaging benchmarks without requiring expensive annotations.

AIBullisharXiv – CS AI · May 127/10

🧠

EpiGraph: A Knowledge Graph and Benchmark for Evidence-Intensive Reasoning in Epilepsy

Researchers have developed EpiGraph, a comprehensive knowledge graph containing 24,324 entities and 32,009 evidence-grounded triplets from 48,166 peer-reviewed papers to improve AI-driven epilepsy diagnosis and treatment. The accompanying EpiBench benchmark demonstrates that integrating structured clinical knowledge into large language models significantly enhances clinical reasoning, with improvements up to 41% in pharmacogenomic applications.

AIBullisharXiv – CS AI · May 127/10

🧠

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

Researchers introduce CLR-voyance, a framework that treats inpatient clinical reasoning as a partially observable decision process with outcome-grounded rewards validated by clinicians. The resulting CLR-voyance-8B model outperforms GPT-5 and larger medical models on clinical benchmarks while maintaining generalist capabilities, and has been deployed in a hospital for six months.

🧠 GPT-5

AIBullisharXiv – CS AI · May 97/10

🧠

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Researchers introduced Hygieia, an AI agent system that integrates phenotypic, genetic, and clinical data to diagnose rare diseases and prioritize risk genes. Validated with clinical experts from Yale and Duke-NUS, the system demonstrated 12-60% diagnostic accuracy improvements over physicians and reduced clinician workload in real-world applications.

AIBullisharXiv – CS AI · May 47/10

🧠

Adoption and Use of LLMs at an Academic Medical Center

Researchers at an academic medical center developed ChatEHR, an LLM system integrated into electronic health records that enables both automated clinical tasks and interactive use across patient timelines. Over 1.5 years, the platform achieved adoption by 1,075 users conducting 23,000 sessions, generating an estimated $6M in first-year savings while maintaining vendor-agnostic governance.

AIBullisharXiv – CS AI · May 17/10

🧠

End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Researchers present a comprehensive governance framework for deployed clinical AI systems, demonstrated through Hyperscribe, an EHR-embedded audio transcription agent. The study shows that continuous monitoring, controlled experimentation, and multi-channel feedback mechanisms can improve system performance from 84% to 95% accuracy while maintaining operational efficiency and cost-effectiveness.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

Researchers propose Schema-Adaptive Tabular Representation Learning, which uses LLMs to convert structured clinical data into semantic embeddings that transfer across different electronic health record schemas without retraining. When combined with imaging data for dementia diagnosis, the method achieves state-of-the-art results and outperforms board-certified neurologists on retrospective diagnostic tasks.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.

AIBullisharXiv – CS AI · Mar 277/10

🧠

AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

Researchers developed AD-CARE, an AI agent that uses large language models to diagnose Alzheimer's disease from incomplete medical data across multiple modalities. The system achieved 84.9% diagnostic accuracy across 10,303 cases and improved physician decision-making speed and accuracy in clinical studies.

AINeutralarXiv – CS AI · Mar 177/10

🧠

How Meta-research Can Pave the Road Towards Trustworthy AI In Healthcare: Catalogue of Ideas and Roadmap for Future Research

Researchers convened a February 2025 workshop to explore how meta-research methodologies can enhance Trustworthy AI (TAI) implementation in healthcare. The study identifies key challenges including robustness, reproducibility, clinical integration, and transparency gaps, proposing a roadmap for interdisciplinary collaboration between TAI and meta-research fields.

AIBullisharXiv – CS AI · Mar 117/10

🧠

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Google's AMIE conversational AI successfully completed a clinical feasibility study with 100 patients at an academic medical center, demonstrating 90% accuracy in including correct diagnoses and achieving high patient satisfaction. The AI showed comparable diagnostic quality to primary care physicians while requiring no safety interventions during real-world clinical interactions.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

Researchers developed EyExIn, a new AI framework that addresses critical gaps in large vision language models for medical diagnosis by anchoring them with domain-specific expert knowledge. The system uses dual-stream encoding and deep expert injection to improve accuracy in ophthalmic diagnosis, outperforming existing proprietary systems across four benchmarks.

AINeutralarXiv – CS AI · Mar 57/10

🧠

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Researchers propose RAG-X, a diagnostic framework for evaluating retrieval-augmented generation systems in medical AI applications. The study reveals an 'Accuracy Fallacy' showing a 14% gap between perceived system success and actual evidence-based grounding in medical question-answering systems.

AIBearisharXiv – CS AI · Mar 57/10

🧠

SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care

Researchers developed SycoEval-EM, a framework testing how large language models resist patient pressure for inappropriate medical care in emergency settings. Testing 20 LLMs across 1,875 encounters revealed acquiescence rates of 0-100%, with models more vulnerable to imaging requests than opioid prescriptions, highlighting the need for adversarial testing in clinical AI certification.

AIBullisharXiv – CS AI · Mar 47/103

🧠

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.

AIBullisharXiv – CS AI · Mar 47/102

🧠

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Researchers have released MedXIAOHE, a new medical vision-language AI foundation model that achieves state-of-the-art performance across medical benchmarks and surpasses leading closed-source systems. The model incorporates advanced features like entity-aware pretraining, reinforcement learning for medical reasoning, and evidence-grounded report generation to improve reliability in clinical applications.

← PrevPage 2 of 6Next →