AINeutralarXiv – CS AI · May 77/10
🧠Researchers developed and validated the first FMECA (Failure Mode, Effects, and Criticality Analysis) framework to systematically assess patient safety risks in clinical summaries generated by large language models. Testing with GPT-OSS 120B on real hospital discharge summaries demonstrated moderate-to-substantial inter-rater agreement and identified 14 distinct failure modes, establishing a reproducible methodology for evaluating AI-generated clinical content safety.
AIBullisharXiv – CS AI · May 17/10
🧠CareGuardAI is a safety framework designed to mitigate clinical risks and hallucinations in patient-facing medical LLMs through dual risk assessment mechanisms. The system employs context-aware multi-agent guardrails that evaluate both clinical safety and factual reliability before releasing responses, outperforming GPT-4o-mini on specialized healthcare benchmarks.
🧠 GPT-4
AIBearisharXiv – CS AI · Apr 67/10
🧠A research paper examines reliability issues in AI-assisted medication decision systems, finding that even systems with good aggregate performance can produce dangerous errors in real-world healthcare scenarios. The study emphasizes that single incorrect AI recommendations in medication management can cause severe patient harm, highlighting the need for human oversight and risk-aware evaluation approaches.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce EHR-ReasonCon, a benchmark dataset and EHR-Inspector, an LLM-based framework designed to verify consistency between unstructured clinical notes and structured data in Electronic Health Records. The work addresses a critical gap in healthcare data quality by moving beyond simple value matching to capture clinical reasoning, temporal relationships, and event interpretations that reflect real-world documentation practices.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers demonstrate that automated evaluation metrics can reliably assess AI-generated responses to patient hospitalization questions, matching human expert ratings across 2,800 responses from 28 AI systems. This approach addresses the scalability limitations of manual expert review while maintaining accuracy across three key dimensions: question answering, clinical evidence use, and medical knowledge application.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers systematically evaluated large language models against supervised BERT models for extracting post-discharge clinical actions from narrative hospital notes. LLMs matched or exceeded supervised baselines on binary actionability detection but lagged on fine-grained multi-label classification, revealing that performance gaps stem from misalignment between model reasoning and annotation conventions rather than pure capability limitations.
AIBearishcrypto.news · Apr 116/10
🧠Maine and Missouri are advancing legislative bans on AI therapy chatbots, reflecting growing state-level regulatory skepticism toward AI-driven mental health services. This trend signals potential restrictions on a developing sector, though the movement remains fragmented across individual states without federal coordination.
AIBearishThe Register – AI · Mar 47/10
🧠Research reveals that AI-powered medical assistant systems can be easily manipulated to change prescriptions and provide harmful medical advice. The study highlights significant vulnerabilities in AI healthcare tools that could pose serious risks to patient safety.
AIBearisharXiv – CS AI · Mar 27/1019
🧠Researchers propose a new risk-sensitive framework for evaluating AI hallucinations in medical advice that considers potential harm rather than just factual accuracy. The study reveals that AI models with similar performance show vastly different risk profiles when generating medical recommendations, highlighting critical safety gaps in current evaluation methods.
AIBearisharXiv – CS AI · Feb 276/107
🧠Researchers developed ClinDet-Bench, a new benchmark that reveals large language models fail to properly identify when they have sufficient information to make clinical decisions. The study shows LLMs make both premature judgments and excessive abstentions in medical scenarios, highlighting safety concerns for AI deployment in healthcare settings.