#clinical-nlp News & Analysis

30 articles tagged with #clinical-nlp. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

30 articles

AINeutralarXiv – CS AI · 2d ago7/10

🧠

DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation

Researchers introduce DrugBench, a benchmark for evaluating AI safety protocols in medical LLM applications, combining 3,671 medical conversations with FDA drug data to test systems against medication-related harms. The study reveals that existing AI control mechanisms can be circumvented and proposes severity-based monitoring to better account for the potential consequences of unsafe outputs in clinical contexts.

AIBearisharXiv – CS AI · 6d ago7/10

🧠

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

Researchers demonstrate that clinical NLP datasets for suicidality detection, particularly the ScAN dataset built on MIMIC-III notes, embed specific operational choices that obscure how labels are constructed rather than representing objective ground truth. The study reveals that dataset design decisions—including single annotators, ICD-based cohort selection, and hospital-stay aggregation—shape what suicidality means in algorithmic systems, highlighting critical gaps between documented clinical judgments and actual suicidal intent.

AIBullisharXiv – CS AI · Jun 97/10

🧠

CARE: A Conformal Safety Layer for Medical Summarization

CARE introduces a conformal safety layer that detects hallucinations and omissions in LLM-generated medical summaries without retraining. The system provides formal, distribution-free guarantees for controlling safety risks while reducing clinician review burden by up to 5x compared to alternative methods.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

Researchers demonstrate that Group Relative Policy Optimization (GRPO) combined with a novel Variance-Aware Reward Framework significantly improves smaller LLMs' performance on medical question answering, particularly for heart-related queries. The approach achieves 38% accuracy improvement on a held-out test set while remaining competitive with much larger models, offering a practical path toward efficient, deployable medical AI systems.

AIBullisharXiv – CS AI · May 297/10

🧠

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.

🧠 Llama

AIBullisharXiv – CS AI · May 287/10

🧠

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI introduces a multi-agent LLM framework that generates therapeutic dialogue grounded in patient narratives and dynamically controlled MI strategies. The system benchmarks six LLMs across 6,000 simulated dialogues and demonstrates that situational context and macro-level strategy control improve clinical adherence to motivational interviewing standards.

AIBullisharXiv – CS AI · May 287/10

🧠

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

Researchers introduce Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical LLMs that estimates token-level confidence directly from existing summaries rather than sampling new outputs. The method achieves significant performance improvements on clinical datasets while reducing computational costs, advancing the critical goal of making AI systems safer for healthcare applications.

AIBullisharXiv – CS AI · Apr 207/10

🧠

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Researchers introduce AcuLa, a post-training framework that aligns audio encoders with medical language models to enhance clinical understanding of auscultation sounds. The method leverages LLMs to generate synthetic clinical reports from audio metadata and achieves significant performance improvements across 18 cardio-respiratory tasks, including boosting COVID-19 cough detection from 55% to 89% accuracy.

AINeutralarXiv – CS AI · 2d ago5/10

🧠

Clinical Term Extraction using Open-Source Small Language Models

Researchers evaluated 26 open-source small language models for extracting clinical terms related to amyotrophic lateral sclerosis (ALS) from unstructured patient notes, finding that hybrid approaches combining rule-based methods with machine learning outperform either approach alone. The study demonstrates that modest-sized language models can handle specialized medical information extraction tasks without task-specific training, though traditional regex-based systems remain competitive for this application.

AINeutralarXiv – CS AI · 2d ago5/10

🧠

Explanation-Guided Medical Named Entity Recognition with Stability and Boundary Awareness for Atopic Dermatitis

Researchers propose an explanation-guided framework for medical named entity recognition (NER) in Chinese atopic dermatitis clinical texts, using stability and boundary-aware constraints to improve model reliability and interpretability. The method combines perturbation-based analysis with adaptive fusion of local and global explanations, achieving performance gains across multiple NER models while enhancing explanation robustness for clinical decision support.

AINeutralarXiv – CS AI · 6d ago6/10

🧠

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

Researchers introduce PhysAssistBench, a new evaluation framework for testing large language models in real-world clinical settings where physicians, patients, and electronic health records interact simultaneously. The benchmark reveals that current leading LLMs struggle with coordinating medical knowledge, patient communication, and precise system interactions together, exposing a critical gap between isolated capability improvements and practical clinical assistance.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Researchers introduce Lung-R1, an LLM specialized in pulmonary disease diagnosis that integrates a structured knowledge graph (LungKG) containing 59,038 nodes and 164,308 edges to enable patient-specific diagnostic reasoning from electronic medical records. The model achieves state-of-the-art performance on diagnostic tasks, demonstrating that grounding LLMs with domain-specific knowledge graphs significantly improves clinical reasoning over general knowledge recall.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

Researchers have developed an automated system for evaluating Korean toddler pronunciation using speaker diarization and self-supervised learning models, addressing a significant gap in speech assessment tools for this demographic. The system achieved balanced accuracies of 0.720 for consonants and 0.845 for vowels by routing predictions through specialized SSL models, offering potential clinical applications for detecting speech sound disorders affecting nearly half of Korean pediatric cases.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Expert-Level Crisis Detection in Mental Health Conversations

Researchers introduce CRADLE-Dialogue, a clinician-annotated benchmark dataset with 600 dialogues for detecting mental health crises in real-time conversations. The study reveals that identifying when risk emerges in multi-turn dialogues is significantly harder than recognizing risk exists, with models achieving only 40-60% F1 scores, and releases a 32B-parameter model competitive with proprietary alternatives.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

Researchers evaluated LLaMA 3.1, an open-weight large language model, for extracting structured information from Dutch brain MRI reports. The model achieved high accuracy (80-96%) on visual rating scores and detection tasks, with few-shot prompting further improving performance on numerical variables, demonstrating practical viability for automated medical data extraction in radiology.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning

Researchers developed a Cardiology Interface Terminology (CIT) system using machine learning to automatically highlight critical information in electronic health records, achieving 74.21% coverage with 98.2% completeness in identifying relevant clinical details.

AINeutralarXiv – CS AI · Jun 96/10

🧠

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

RadOT-Eval is a new AI framework that uses optimal transport algorithms to automatically evaluate radiology report generation by decomposing reports into structured clinical evidence units and detecting specific error types like omissions, hallucinations, and polarity reversals. The method achieves higher correlation with clinician-annotated errors than existing metrics and LLM-based evaluators, providing an auditable approach for quality assurance in high-stakes medical AI applications.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

Researchers evaluated how large language models performing structured data extraction from clinical notes respond to variations in prompts, model sizes, and data schemas. The study found that schema design—particularly the distinction between absent versus undocumented information—drives disagreement more than prompt phrasing, while model choice significantly impacts multi-class categorization tasks.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Evaluating the Utility of Personal Health Records in Personalized Health AI

A research study evaluates how large language models like Gemini 3.0 Flash can better answer patient health questions when provided with Personal Health Record (PHR) context. Testing 2,257 patient queries against de-identified PHRs showed significant improvements in helpfulness, safety, and accuracy, though the study identified specific gaps in LLM understanding of complex clinical data like temporal relationships.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 25/10

🧠

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

The LinguIUTics team achieved 4th place in the PsyDefDetect 2026 shared task by fine-tuning Qwen3-8B to classify psychological defense mechanisms in clinical conversational text, reaching a macro F1-score of 0.3917 and substantially improving performance on rare classes through specialized techniques including minority-class augmentation and ensemble methods.

AINeutralarXiv – CS AI · Jun 26/10

🧠

What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection

Researchers demonstrate that fine-tuned large language models, particularly BERT, T5, and Llama-1B, achieve state-of-the-art performance in detecting Alzheimer's disease from speech transcripts across multiple datasets. The study reveals how these models encode disease-related linguistic signals through fine-tuning, advancing the potential for early AD diagnosis through text analysis.

🧠 Llama

AINeutralarXiv – CS AI · May 276/10

🧠

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

Researchers introduce EHR-ReasonCon, a benchmark dataset and EHR-Inspector, an LLM-based framework designed to verify consistency between unstructured clinical notes and structured data in Electronic Health Records. The work addresses a critical gap in healthcare data quality by moving beyond simple value matching to capture clinical reasoning, temporal relationships, and event interpretations that reflect real-world documentation practices.

AINeutralarXiv – CS AI · May 276/10

🧠

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

Researchers developed a hybrid neural-symbolic pipeline for extracting clinical follow-up instructions from outpatient notes, pairing medical actions with future dates. The system significantly outperformed generative AI models (GPT-4o-mini and LLaMA-3) at linking actions to dates, achieving 99.7% F1 score on seen data versus 51-57% for baselines, demonstrating that symbolic reasoning outperforms pure language generation for structured clinical extraction tasks.

🧠 GPT-4

AINeutralarXiv – CS AI · May 276/10

🧠

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

EHRSummarizer presents a privacy-focused reference architecture for automatically summarizing fragmented electronic health records using FHIR standards and constrained AI summarization. The system addresses clinical workflow inefficiencies by normalizing health data and producing source-grounded summaries, though the research remains a prototype without clinical validation or demonstrated outcomes.

AINeutralarXiv – CS AI · May 126/10

🧠

Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports

Researchers developed a semi-structured extraction method for digitizing fragmented clinical reports using OCR and question-answering models, introducing 'key coverage' as a metric to measure data completeness. The approach achieves F1 scores above 0.83 on real-world hospital data from 20+ institutions using a lightweight BERT model, demonstrating that canonical key inventory completeness drives extraction performance.

Page 1 of 2Next →