#hallucination-detection News & Analysis

61 articles tagged with #hallucination-detection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

61 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Researchers discovered that language models can detect undesirable behaviors like hallucination with near-perfect accuracy, yet the neural directions enabling detection are nearly orthogonal (83 degrees apart) from those controlling the behavior. This fundamental geometric dissociation between knowing and steering persists across multiple models and scales, challenging a core assumption of mechanistic interpretability that detection should enable control.

AINeutralarXiv – CS AI · Jun 237/10

🧠

When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

Researchers identify 'retrieval-state lock-in,' a failure mode in retrieval-augmented generation (RAG) systems where multiple sampled answers agree despite being wrong because they condition on the same defective retrieval state. The study proposes decomposing confidence scores into three components—answer surface, evidence, and retrieval state—achieving 91.9% precision by requiring all three to agree, though this certifies only 7.7% of answers as low-risk.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

Researchers demonstrate that internal computational artifacts within Large Language Models can reliably detect when the model produces incorrect outputs in legal classification tasks. By analyzing these internal signals, downstream classifiers can identify hallucinated or erroneous predictions, potentially improving the reliability of LLM-based legal systems for high-stakes applications like bail decisions and statute violation predictions.

AINeutralarXiv – CS AI · Jun 237/10

🧠

GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

GroundEval introduces a deterministic framework for evaluating AI agents by auditing their evidence retrieval and reasoning paths rather than relying on LLM judges. The tool detected a critical failure case where frontier LLM judges scored an agent response above 0.85, but the actual trace revealed the agent never retrieved the artifact it cited, yielding a GroundEval score of 0.000.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

Researchers introduce LUCID, a novel hallucination detection method for large language models used in knowledge graph reasoning tasks. By combining LLM attention scores, knowledge graph semantics, and structural information through graph neural networks, LUCID achieves state-of-the-art performance across nine datasets, addressing a critical reliability gap in AI-driven knowledge systems.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

Researchers propose Global-Local Uncertainty (GLU), a new method for quantifying uncertainty in large language models by combining hidden-state geometric entropy with token-level signals. The approach successfully identifies confident-but-wrong predictions that existing token-only methods miss, offering improved reliability assessment across multiple model families.

AIBullisharXiv – CS AI · Jun 97/10

🧠

CARE: A Conformal Safety Layer for Medical Summarization

CARE introduces a conformal safety layer that detects hallucinations and omissions in LLM-generated medical summaries without retraining. The system provides formal, distribution-free guarantees for controlling safety risks while reducing clinician review burden by up to 5x compared to alternative methods.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Researchers demonstrate that Whisper, OpenAI's widely-used speech recognition model, can detect and mitigate hallucinations—fabricated coherent transcriptions from non-speech audio—using Sparse AutoEncoders and activation-space steering. The approach reduces hallucination rates from 72-87% to 14-27% across model sizes with minimal performance degradation on actual speech.

AIBullisharXiv – CS AI · Jun 87/10

🧠

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

Researchers introduce OpenHalDet, an open-source benchmark framework that standardizes hallucination detection evaluation across diverse LLM scenarios. The unified framework addresses reproducibility challenges by providing consistent evaluation pipelines and supporting multiple detector types (black-box, gray-box, white-box), enabling more reliable comparison of hallucination detection methods.

AINeutralarXiv – CS AI · Jun 47/10

🧠

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Researchers introduce CHARM, a framework for detecting and mitigating cascading hallucinations in multi-step AI reasoning pipelines where errors compound across stages. The system achieves 89.4% detection accuracy with minimal false positives, addressing a critical vulnerability in agentic RAG systems that existing methods fail to catch.

AIBullisharXiv – CS AI · Jun 37/10

🧠

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

TriEval introduces an open-source pipeline for evaluating large language models across bias, toxicity, and truthfulness simultaneously while requiring minimal computational resources. The tool runs on standard laptops without GPU clusters, making rigorous LLM safety testing accessible to researchers with limited budgets, and reveals significant performance differences between open-source and closed-source models.

🧠 Claude🧠 Llama

AIBullisharXiv – CS AI · Jun 27/10

🧠

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

TriLens is a novel white-box detection method that identifies hallucinations in language models by tracking entropy changes across internal computational layers. Rather than examining only final outputs, the technique monitors uncertainty signals from multi-head attention, feed-forward networks, and residual streams using logit lens analysis, creating a compact 3L-dimensional trajectory that reveals how model confidence settles during inference.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Detect Before You Leap: Mirage Detection in Vision-Language Models

Researchers have developed TC-LIA, a model-agnostic detection method that identifies when Vision-Language Models produce confident but visually ungrounded answers—a failure mode called 'mirage.' The technique achieves 94.6-94.7% accuracy in detecting these hallucinations across multiple VLM architectures, reducing mirage rates from 21.7-66.6% to below 3%, with significant implications for medical and document-based AI systems where false confidence poses safety risks.

AIBullisharXiv – CS AI · Jun 17/10

🧠

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

Researchers propose treating hallucination detection in large language models as an out-of-distribution (OOD) detection problem, leveraging computer vision techniques to create training-free detectors. This geometric approach shows strong performance on reasoning tasks where existing methods struggle, offering a scalable pathway to improve LLM safety and reliability.

AIBullisharXiv – CS AI · Jun 17/10

🧠

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

Researchers introduce LLM-FACETS, an open-source framework designed to make LLM auditing accessible to non-technical practitioners while preserving data privacy. The system addresses regulatory compliance needs outlined in the EU AI Act and NIST frameworks by providing browser-based evaluation tools that keep sensitive data on self-hosted servers rather than transmitting it to external services.

AIBullisharXiv – CS AI · May 297/10

🧠

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.

🧠 Llama

AIBearisharXiv – CS AI · May 287/10

🧠

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

Researchers identify a critical failure mode in Retrieval-Augmented Generation (RAG) evaluation called 'citation laundering,' where topically relevant sources are presented as evidence for claims they don't actually support. The team introduces FORCEBENCH, a diagnostic benchmark that tests whether AI evaluators can distinguish between evidence-calibrated claims and over-generalized ones, revealing that current evaluation methods fail to detect warrant mismatches in 24-47% of cases.

AINeutralarXiv – CS AI · May 277/10

🧠

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Researchers introduce Trajel, a dataset and evaluation framework for detecting hallucinations in multi-step LLM agent workflows, revealing that existing benchmarks miss intermediate failures. The framework defines five hallucination types and shows that trajectory-level detection outperforms traditional post-hoc verification, highlighting critical gaps in current AI safety evaluation methodologies.

AIBullisharXiv – CS AI · May 277/10

🧠

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

Researchers present a hybrid neuro-symbolic architecture that combines formal logic with neural semantic analysis to verify LLM outputs in high-stakes domains like healthcare. The system achieves over 83% hallucination detection rates for structured data and 72% for semantic fabrications while reducing report creation time by 30%, demonstrating practical safeguards for deploying LLMs in data-sensitive applications.

AIBullisharXiv – CS AI · May 277/10

🧠

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne introduces Chain-of-Evidence, a verifiability framework addressing critical failures in autonomous research systems where AI agents produce plausible-looking but unreliable outputs including fabricated citations, unverified scores, and misaligned methods. The system achieves zero hallucinated references and perfect score verification across five research tasks, significantly outperforming existing baseline systems that exhibit systematic failure rates up to 80%.

AIBearisharXiv – CS AI · May 277/10

🧠

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Researchers identify a critical vulnerability in retrieval-augmented generation systems where language models produce faithful-looking outputs from memory rather than retrieved context, making it impossible to verify source attribution through output analysis alone. They propose Computational Reality Monitoring (CRM), a technique that detects internal representational differences to identify when models rely on pretraining data versus external evidence.

AINeutralarXiv – CS AI · May 277/10

🧠

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.

AINeutralarXiv – CS AI · May 127/10

🧠

Sanity Checks for Long-Form Hallucination Detection

Researchers introduce a controlled-invariance methodology to distinguish whether hallucination detection in large language models actually evaluates reasoning quality or merely exploits surface-level answer cues. Their lightweight TRACT model demonstrates that effective detection relies primarily on lexical trajectory features rather than complex learned representations, suggesting current detection methods conflate endpoint artifacts with genuine reasoning validation.

AINeutralarXiv – CS AI · May 127/10

🧠

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Microsoft researchers released Delulu, a benchmark dataset containing 1,951 code generation samples across 7 programming languages designed to test how well large language models detect hallucinations in Fill-in-the-Middle tasks. Testing 11 open-weight models revealed fundamental limitations, with even the strongest achieving only 84.5% accuracy, indicating that code hallucination remains a persistent challenge across all model families.

AINeutralarXiv – CS AI · May 97/10

🧠

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Researchers have identified a geometric framework explaining how language models fail through two distinct mechanisms: parametric memory conflicting with working memory, and hallucination from absent learned facts. Both failures produce confident outputs despite being mechanistically different, but hidden-state geometry and 'geometric margin' metrics can distinguish them more reliably than traditional entropy-based detection methods.

Page 1 of 3Next →