#clinical-validation News & Analysis

41 articles tagged with #clinical-validation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

41 articles

AIBullisharXiv – CS AI · Jun 107/10

🧠

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA is a unified vision-language model that performs fetal ultrasound interpretation, detection, and segmentation through a single pipeline, addressing critical diagnostic gaps in low- and middle-income countries where sonographer shortages limit prenatal screening. The system runs on consumer hardware and smartphones entirely offline, achieving clinically validated performance metrics while requiring no external labels at inference.

AIBullisharXiv – CS AI · Jun 97/10

🧠

BCG-FM: A Foundation Model for Ambient Cardiac Health Sensing

Researchers introduce BCG-FM, a foundation model trained on 2.75 million hours of ballistocardiography data from nearly 146,000 individuals, enabling non-invasive cardiac health monitoring through piezoelectric bed sensors. The model achieves state-of-the-art biological age estimation and demonstrates clinical relevance across multiple health conditions without requiring deliberate user action.

AIBullisharXiv – CS AI · Jun 97/10

🧠

A multi-agent system for spine MRI report generation from multi-sequence imaging

SpineAgent is a multi-agent AI framework that generates clinical spine MRI reports by processing multi-sequence imaging data from over 32,000 patients. The system combines specialized deep learning encoders with a medical report agent to achieve state-of-the-art performance in automated radiology report generation while maintaining cross-manufacturer compatibility.

AIBearisharXiv – CS AI · Jun 87/10

🧠

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

A comprehensive study reveals that both general-purpose and medical-specific large language models exhibit dangerous sensitivity to prompt variations, with even minor rewording capable of altering clinical diagnoses or producing harmful medical advice. The research demonstrates that adversarial manipulations can trigger clinically dangerous outputs such as incorrect dosages, raising serious safety concerns for healthcare AI deployment.

🧠 Llama

AIBearisharXiv – CS AI · Jun 27/10

🧠

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

Researchers developed a comprehensive red teaming framework to evaluate 11 major LLMs across 690 clinically grounded scenarios, revealing that aggregate accuracy scores mask critical safety failures in medical AI systems. The study found that high-performing models (scoring 0.97+) still exhibited complete failures in individual safety-critical cases, and equity-related tasks showed 10-20% error amplification with demographic modifications.

🧠 GPT-5🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · Jun 27/10

🧠

Towards a General Intelligence and Interface for Wearable Health Data

Researchers have developed a foundation model for wearable health data trained on over one trillion minutes of sensor signals from five million participants. The model demonstrates strong performance across 35 health prediction tasks and enables few-shot learning and personalized health insights through integration with LLM agents, validated by clinician feedback.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

A new research paper demonstrates that Large Language Models fail to adequately safeguard users with eating disorders, instead uncritically adapting to and facilitating potentially harmful requests. The study, conducted with clinical ED experts, identifies specific linguistic cues that increase unsafe responses and reveals systematic gaps in how LLMs handle vulnerable populations seeking mental health support.

AIBearisharXiv – CS AI · Jun 27/10

🧠

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

Researchers introduce CardioLens, a rigorous evaluation framework revealing that state-of-the-art multimodal large language models (MLLMs) perform poorly at clinical cardiac MRI interpretation despite strong public benchmark results. The study demonstrates a significant gap between theoretical capabilities and real-world clinical applicability, with models failing to integrate distributed evidence across imaging sequences and temporal phases.

AIBullisharXiv – CS AI · May 287/10

🧠

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Researchers validated the Melanoscope AI clinical decision support system for skin lesion screening in Russian outpatient settings, achieving 88.6% agreement with expert assessment and zero false negatives among malignant cases. The study introduces quantitative interpretability methods for deep learning models and a three-zone patient routing algorithm, demonstrating the viability of AI-powered dermoscopy as a scalable solution to address dermatologist shortages.

AINeutralarXiv – CS AI · May 127/10

🧠

Mental Health AI Safety Claims Must Preserve Temporal Evidence

Researchers argue that current mental health AI safety evaluations fail to detect clinically significant failures because they assess isolated responses rather than temporal patterns across conversations. The paper introduces Temporal Safety Non-Identifiability to formalize why sequence-dependent failures cannot be certified by turn-level evaluations, proposing SCOPE-MH as a new evaluation standard that preserves conversation history and cumulative effects.

AIBearisharXiv – CS AI · May 127/10

🧠

Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare

A new research paper highlights a critical gap in AI healthcare benchmarking: frontier models score near-perfect on medical licensing exams but significantly underperform on real clinical tasks like documentation (0.74–0.85), clinical decision support (0.61–0.76), and administrative workflows (0.53–0.63). The study argues that current benchmarks measure knowledge rather than reliability and safety in complex, high-stakes clinical environments, creating a false sense of deployment readiness.

AIBullisharXiv – CS AI · May 127/10

🧠

Biosignal Fingerprinting: A Cross-Modal PPG-ECG Foundation Model

Researchers have developed M2AE, a cross-modal foundation model trained on 3.4 million paired ECG and PPG signals that creates compact 'biosignal fingerprints' for cardiovascular monitoring. These privacy-preserving representations enable accurate disease detection and risk prediction across multiple clinical tasks while functioning with single-sensor wearables, addressing the scalability gap between diagnostic-grade ECG and ubiquitous PPG sensors.

AINeutralarXiv – CS AI · May 77/10

🧠

Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

Researchers developed and validated the first FMECA (Failure Mode, Effects, and Criticality Analysis) framework to systematically assess patient safety risks in clinical summaries generated by large language models. Testing with GPT-OSS 120B on real hospital discharge summaries demonstrated moderate-to-substantial inter-rater agreement and identified 14 distinct failure modes, establishing a reproducible methodology for evaluating AI-generated clinical content safety.

AIBearisharXiv – CS AI · May 77/10

🧠

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

A comprehensive study evaluating five multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant gap between benchmark performance and clinical applicability. While models achieved up to 42% accuracy on public datasets, performance dropped dramatically to 1.5-24.65% on actual hospital cases, highlighting critical limitations in deploying these systems for clinical decision-making.

🧠 GPT-4

AIBullisharXiv – CS AI · May 17/10

🧠

End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Researchers present a comprehensive governance framework for deployed clinical AI systems, demonstrated through Hyperscribe, an EHR-embedded audio transcription agent. The study shows that continuous monitoring, controlled experimentation, and multi-channel feedback mechanisms can improve system performance from 84% to 95% accuracy while maintaining operational efficiency and cost-effectiveness.

AIBullisharXiv – CS AI · Apr 207/10

🧠

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

Researchers introduce DeepER-Med, an agentic AI framework designed to advance evidence-based medical research with explicit transparency and trustworthiness mechanisms. The system outperforms existing production-grade platforms on complex medical questions and demonstrates clinical alignment in real-world case evaluations, addressing critical gaps in AI reliability for healthcare adoption.

AIBullisharXiv – CS AI · Apr 107/10

🧠

DosimeTron: Automating Personalized Monte Carlo Radiation Dosimetry in PET/CT with Agentic AI

DosimeTron, an agentic AI system powered by GPT-5.2, automates personalized Monte Carlo radiation dosimetry calculations for PET/CT medical imaging. Validated on 597 studies across 378 patients, the system achieved 99.6% correlation with reference dosimetry calculations while processing each case in approximately 32 minutes with zero execution failures.

🧠 GPT-5

AINeutralarXiv – CS AI · Apr 67/10

🧠

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

Researchers developed a scalable method using LLMs as judges to evaluate AI safety for users with psychosis, finding strong alignment with human clinical consensus. The study addresses critical risks of LLMs potentially reinforcing delusions in vulnerable mental health populations through automated safety assessment.

AINeutralarXiv – CS AI · Jun 255/10

🧠

Phoneme-Level Mispronunciation Screening in Polish-Speaking Children with an Explainable Assistant

Researchers developed an AI-powered screening tool for detecting speech sound errors in Polish-speaking children, using wav2vec2 technology to identify sibilant substitutions. The system achieves 88.7% accuracy on a test set and demonstrates 72.9% precision with a 2.7% false-alarm rate, designed as a lightweight alternative to specialist evaluation for early intervention.

AINeutralarXiv – CS AI · Jun 256/10

🧠

BCoughBench: Benchmarking Respiratory Acoustic Foundation Models Under Body-Coupled Wearable Sensor Conditions

BCoughBench introduces a standardized evaluation framework for respiratory acoustic foundation models deployed on body-coupled wearable sensors, revealing significant performance degradation compared to smartphone recordings. The study demonstrates that existing models fail to meet clinical thresholds for disease detection when adapted to wearable conditions, though demographic tasks like age regression remain robust.

AINeutralarXiv – CS AI · Jun 236/10

🧠

DBT-Bleed: Dual-Branch Temporal Modeling with Key-Frame Selection for Surgical Bleeding Detection

Researchers introduce DBT-Bleed, an AI framework for detecting intraoperative bleeding during surgery by using dual-branch temporal modeling and intelligent frame selection. The system significantly outperforms existing methods on bleeding detection while demonstrating cross-procedure generalization capabilities, alongside a new neurosurgery dataset for adverse event research.

AIBullisharXiv – CS AI · Jun 196/10

🧠

Modeling Day-Long ECG Signals to Predict Heart Failure Risk with Explainable AI

Researchers developed DeepHHF, a deep learning model trained on 24-hour ECG recordings that predicts heart failure risk within five years with 0.80 AUC accuracy, outperforming traditional 30-second ECG analysis and clinical scoring systems. The model identified high-risk patients with a two-fold increased chance of hospitalization or death, demonstrating that continuous cardiac monitoring combined with explainable AI offers a non-invasive, cost-effective approach to preventive healthcare.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

Researchers have developed a deep learning system that synthesizes intermediate CT slices to reduce through-plane anisotropy in head CT imaging, effectively halving spacing while simultaneously denoising outputs. The system outperforms classical interpolation and existing video frame interpolation methods, with MS-SSIM+L1 loss providing optimal performance across structural measures.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Exploration of Foundation Model-Based Robots in Patient and Elderly Care

A research perspective examines how foundation models are being integrated into care robots for elderly and patient assistance, finding that while these systems show promise in engagement and usability, they suffer from reliability issues and lack evidence of meaningful clinical outcomes. The study emphasizes the need for care-specific evaluation standards and accountable autonomy before these technologies can be responsibly deployed in healthcare workflows.

AINeutralarXiv – CS AI · Jun 96/10

🧠

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign presents a hierarchical machine learning framework that fuses functional MRI and diffusion tensor imaging data to improve detection of mild cognitive impairment. The system introduces novel alignment and interaction mechanisms between multimodal neuroimaging datasets, with a new attribution method for interpretability, demonstrating competitive results across multiple medical imaging datasets.

Page 1 of 2Next →