AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers introduce DrugBench, a benchmark for evaluating AI safety protocols in medical LLM applications, combining 3,671 medical conversations with FDA drug data to test systems against medication-related harms. The study reveals that existing AI control mechanisms can be circumvented and proposes severity-based monitoring to better account for the potential consequences of unsafe outputs in clinical contexts.
AIBearisharXiv – CS AI · 6d ago7/10
🧠Researchers demonstrate that clinical NLP datasets for suicidality detection, particularly the ScAN dataset built on MIMIC-III notes, embed specific operational choices that obscure how labels are constructed rather than representing objective ground truth. The study reveals that dataset design decisions—including single annotators, ICD-based cohort selection, and hospital-stay aggregation—shape what suicidality means in algorithmic systems, highlighting critical gaps between documented clinical judgments and actual suicidal intent.
AIBullisharXiv – CS AI · Jun 97/10
🧠CARE introduces a conformal safety layer that detects hallucinations and omissions in LLM-generated medical summaries without retraining. The system provides formal, distribution-free guarantees for controlling safety risks while reducing clinician review burden by up to 5x compared to alternative methods.
AIBullisharXiv – CS AI · Jun 57/10
🧠Researchers demonstrate that Group Relative Policy Optimization (GRPO) combined with a novel Variance-Aware Reward Framework significantly improves smaller LLMs' performance on medical question answering, particularly for heart-related queries. The approach achieves 38% accuracy improvement on a held-out test set while remaining competitive with much larger models, offering a practical path toward efficient, deployable medical AI systems.
AIBullisharXiv – CS AI · May 297/10
🧠Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.
🧠 Llama
AIBullisharXiv – CS AI · May 287/10
🧠StoryMI introduces a multi-agent LLM framework that generates therapeutic dialogue grounded in patient narratives and dynamically controlled MI strategies. The system benchmarks six LLMs across 6,000 simulated dialogues and demonstrates that situational context and macro-level strategy control improve clinical adherence to motivational interviewing standards.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers introduce Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical LLMs that estimates token-level confidence directly from existing summaries rather than sampling new outputs. The method achieves significant performance improvements on clinical datasets while reducing computational costs, advancing the critical goal of making AI systems safer for healthcare applications.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce AcuLa, a post-training framework that aligns audio encoders with medical language models to enhance clinical understanding of auscultation sounds. The method leverages LLMs to generate synthetic clinical reports from audio metadata and achieves significant performance improvements across 18 cardio-respiratory tasks, including boosting COVID-19 cough detection from 55% to 89% accuracy.
AINeutralarXiv – CS AI · 2d ago5/10
🧠Researchers evaluated 26 open-source small language models for extracting clinical terms related to amyotrophic lateral sclerosis (ALS) from unstructured patient notes, finding that hybrid approaches combining rule-based methods with machine learning outperform either approach alone. The study demonstrates that modest-sized language models can handle specialized medical information extraction tasks without task-specific training, though traditional regex-based systems remain competitive for this application.
AINeutralarXiv – CS AI · 2d ago5/10
🧠Researchers propose an explanation-guided framework for medical named entity recognition (NER) in Chinese atopic dermatitis clinical texts, using stability and boundary-aware constraints to improve model reliability and interpretability. The method combines perturbation-based analysis with adaptive fusion of local and global explanations, achieving performance gains across multiple NER models while enhancing explanation robustness for clinical decision support.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce PhysAssistBench, a new evaluation framework for testing large language models in real-world clinical settings where physicians, patients, and electronic health records interact simultaneously. The benchmark reveals that current leading LLMs struggle with coordinating medical knowledge, patient communication, and precise system interactions together, exposing a critical gap between isolated capability improvements and practical clinical assistance.
AINeutralarXiv – CS AI · Jun 116/10
🧠Researchers introduce Lung-R1, an LLM specialized in pulmonary disease diagnosis that integrates a structured knowledge graph (LungKG) containing 59,038 nodes and 164,308 edges to enable patient-specific diagnostic reasoning from electronic medical records. The model achieves state-of-the-art performance on diagnostic tasks, demonstrating that grounding LLMs with domain-specific knowledge graphs significantly improves clinical reasoning over general knowledge recall.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers have developed an automated system for evaluating Korean toddler pronunciation using speaker diarization and self-supervised learning models, addressing a significant gap in speech assessment tools for this demographic. The system achieved balanced accuracies of 0.720 for consonants and 0.845 for vowels by routing predictions through specialized SSL models, offering potential clinical applications for detecting speech sound disorders affecting nearly half of Korean pediatric cases.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers introduce CRADLE-Dialogue, a clinician-annotated benchmark dataset with 600 dialogues for detecting mental health crises in real-time conversations. The study reveals that identifying when risk emerges in multi-turn dialogues is significantly harder than recognizing risk exists, with models achieving only 40-60% F1 scores, and releases a 32B-parameter model competitive with proprietary alternatives.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers evaluated LLaMA 3.1, an open-weight large language model, for extracting structured information from Dutch brain MRI reports. The model achieved high accuracy (80-96%) on visual rating scores and detection tasks, with few-shot prompting further improving performance on numerical variables, demonstrating practical viability for automated medical data extraction in radiology.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers developed a Cardiology Interface Terminology (CIT) system using machine learning to automatically highlight critical information in electronic health records, achieving 74.21% coverage with 98.2% completeness in identifying relevant clinical details.
AINeutralarXiv – CS AI · Jun 96/10
🧠RadOT-Eval is a new AI framework that uses optimal transport algorithms to automatically evaluate radiology report generation by decomposing reports into structured clinical evidence units and detecting specific error types like omissions, hallucinations, and polarity reversals. The method achieves higher correlation with clinician-annotated errors than existing metrics and LLM-based evaluators, providing an auditable approach for quality assurance in high-stakes medical AI applications.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers evaluated how large language models performing structured data extraction from clinical notes respond to variations in prompts, model sizes, and data schemas. The study found that schema design—particularly the distinction between absent versus undocumented information—drives disagreement more than prompt phrasing, while model choice significantly impacts multi-class categorization tasks.
AIBullisharXiv – CS AI · Jun 56/10
🧠A research study evaluates how large language models like Gemini 3.0 Flash can better answer patient health questions when provided with Personal Health Record (PHR) context. Testing 2,257 patient queries against de-identified PHRs showed significant improvements in helpfulness, safety, and accuracy, though the study identified specific gaps in LLM understanding of complex clinical data like temporal relationships.
🧠 Gemini
AINeutralarXiv – CS AI · Jun 25/10
🧠The LinguIUTics team achieved 4th place in the PsyDefDetect 2026 shared task by fine-tuning Qwen3-8B to classify psychological defense mechanisms in clinical conversational text, reaching a macro F1-score of 0.3917 and substantially improving performance on rare classes through specialized techniques including minority-class augmentation and ensemble methods.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers demonstrate that fine-tuned large language models, particularly BERT, T5, and Llama-1B, achieve state-of-the-art performance in detecting Alzheimer's disease from speech transcripts across multiple datasets. The study reveals how these models encode disease-related linguistic signals through fine-tuning, advancing the potential for early AD diagnosis through text analysis.
🧠 Llama
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce EHR-ReasonCon, a benchmark dataset and EHR-Inspector, an LLM-based framework designed to verify consistency between unstructured clinical notes and structured data in Electronic Health Records. The work addresses a critical gap in healthcare data quality by moving beyond simple value matching to capture clinical reasoning, temporal relationships, and event interpretations that reflect real-world documentation practices.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers developed a hybrid neural-symbolic pipeline for extracting clinical follow-up instructions from outpatient notes, pairing medical actions with future dates. The system significantly outperformed generative AI models (GPT-4o-mini and LLaMA-3) at linking actions to dates, achieving 99.7% F1 score on seen data versus 51-57% for baselines, demonstrating that symbolic reasoning outperforms pure language generation for structured clinical extraction tasks.
🧠 GPT-4
AINeutralarXiv – CS AI · May 276/10
🧠EHRSummarizer presents a privacy-focused reference architecture for automatically summarizing fragmented electronic health records using FHIR standards and constrained AI summarization. The system addresses clinical workflow inefficiencies by normalizing health data and producing source-grounded summaries, though the research remains a prototype without clinical validation or demonstrated outcomes.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers developed a semi-structured extraction method for digitizing fragmented clinical reports using OCR and question-answering models, introducing 'key coverage' as a metric to measure data completeness. The approach achieves F1 scores above 0.83 on real-world hospital data from 20+ institutions using a lightweight BERT model, demonstrating that canonical key inventory completeness drives extraction performance.