AIBullisharXiv – CS AI · 14h ago7/10
🧠TriEval introduces an open-source pipeline for evaluating large language models across bias, toxicity, and truthfulness simultaneously while requiring minimal computational resources. The tool runs on standard laptops without GPU clusters, making rigorous LLM safety testing accessible to researchers with limited budgets, and reveals significant performance differences between open-source and closed-source models.
🧠 Claude🧠 Llama
AIBullisharXiv – CS AI · 1d ago7/10
🧠TriLens is a novel white-box detection method that identifies hallucinations in language models by tracking entropy changes across internal computational layers. Rather than examining only final outputs, the technique monitors uncertainty signals from multi-head attention, feed-forward networks, and residual streams using logit lens analysis, creating a compact 3L-dimensional trajectory that reveals how model confidence settles during inference.
AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers have developed TC-LIA, a model-agnostic detection method that identifies when Vision-Language Models produce confident but visually ungrounded answers—a failure mode called 'mirage.' The technique achieves 94.6-94.7% accuracy in detecting these hallucinations across multiple VLM architectures, reducing mirage rates from 21.7-66.6% to below 3%, with significant implications for medical and document-based AI systems where false confidence poses safety risks.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce LLM-FACETS, an open-source framework designed to make LLM auditing accessible to non-technical practitioners while preserving data privacy. The system addresses regulatory compliance needs outlined in the EU AI Act and NIST frameworks by providing browser-based evaluation tools that keep sensitive data on self-hosted servers rather than transmitting it to external services.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose treating hallucination detection in large language models as an out-of-distribution (OOD) detection problem, leveraging computer vision techniques to create training-free detectors. This geometric approach shows strong performance on reasoning tasks where existing methods struggle, offering a scalable pathway to improve LLM safety and reliability.
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.
🧠 Llama
AIBearisharXiv – CS AI · 6d ago7/10
🧠Researchers identify a critical failure mode in Retrieval-Augmented Generation (RAG) evaluation called 'citation laundering,' where topically relevant sources are presented as evidence for claims they don't actually support. The team introduces FORCEBENCH, a diagnostic benchmark that tests whether AI evaluators can distinguish between evidence-calibrated claims and over-generalized ones, revealing that current evaluation methods fail to detect warrant mismatches in 24-47% of cases.
AINeutralarXiv – CS AI · May 277/10
🧠Researchers introduce Trajel, a dataset and evaluation framework for detecting hallucinations in multi-step LLM agent workflows, revealing that existing benchmarks miss intermediate failures. The framework defines five hallucination types and shows that trajectory-level detection outperforms traditional post-hoc verification, highlighting critical gaps in current AI safety evaluation methodologies.
AIBullisharXiv – CS AI · May 277/10
🧠ScientistOne introduces Chain-of-Evidence, a verifiability framework addressing critical failures in autonomous research systems where AI agents produce plausible-looking but unreliable outputs including fabricated citations, unverified scores, and misaligned methods. The system achieves zero hallucinated references and perfect score verification across five research tasks, significantly outperforming existing baseline systems that exhibit systematic failure rates up to 80%.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers present a hybrid neuro-symbolic architecture that combines formal logic with neural semantic analysis to verify LLM outputs in high-stakes domains like healthcare. The system achieves over 83% hallucination detection rates for structured data and 72% for semantic fabrications while reducing report creation time by 30%, demonstrating practical safeguards for deploying LLMs in data-sensitive applications.
AINeutralarXiv – CS AI · May 277/10
🧠Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.
AIBearisharXiv – CS AI · May 277/10
🧠Researchers identify a critical vulnerability in retrieval-augmented generation systems where language models produce faithful-looking outputs from memory rather than retrieved context, making it impossible to verify source attribution through output analysis alone. They propose Computational Reality Monitoring (CRM), a technique that detects internal representational differences to identify when models rely on pretraining data versus external evidence.
AINeutralarXiv – CS AI · May 127/10
🧠Microsoft researchers released Delulu, a benchmark dataset containing 1,951 code generation samples across 7 programming languages designed to test how well large language models detect hallucinations in Fill-in-the-Middle tasks. Testing 11 open-weight models revealed fundamental limitations, with even the strongest achieving only 84.5% accuracy, indicating that code hallucination remains a persistent challenge across all model families.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers introduce a controlled-invariance methodology to distinguish whether hallucination detection in large language models actually evaluates reasoning quality or merely exploits surface-level answer cues. Their lightweight TRACT model demonstrates that effective detection relies primarily on lexical trajectory features rather than complex learned representations, suggesting current detection methods conflate endpoint artifacts with genuine reasoning validation.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers have identified a geometric framework explaining how language models fail through two distinct mechanisms: parametric memory conflicting with working memory, and hallucination from absent learned facts. Both failures produce confident outputs despite being mechanistically different, but hidden-state geometry and 'geometric margin' metrics can distinguish them more reliably than traditional entropy-based detection methods.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce SemGrad, a gradient-based uncertainty quantification method for large language models that operates in semantic space rather than parameter space, eliminating the computational overhead of sampling-based approaches. The method measures output stability under semantically equivalent input perturbations to gauge LLM confidence, addressing the critical challenge of hallucinations in free-form text generation.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce Sequential Internal Variance Representation (SIVR), a novel supervised framework for detecting hallucinations in large language models by analyzing token-wise and layer-wise variance patterns in hidden states. The method demonstrates superior generalization compared to existing approaches while requiring smaller training datasets, potentially enabling practical deployment of hallucination detection systems.
AINeutralarXiv – CS AI · Apr 157/10
🧠Researchers introduce VLM-DeflectionBench, a new benchmark with 2,775 samples designed to evaluate how large vision-language models handle conflicting or insufficient evidence. The study reveals that most state-of-the-art LVLMs fail to appropriately deflect when faced with noisy or misleading information, highlighting critical gaps in model reliability for knowledge-intensive tasks.
AINeutralarXiv – CS AI · Apr 107/10
🧠Researchers demonstrate that standard LLM-as-a-judge methods achieve only 52% accuracy in detecting hallucinations and omissions in mental health chatbots, failing in high-risk healthcare contexts. A hybrid framework combining human domain expertise with machine learning features achieves significantly higher performance (0.717-0.849 F1 scores), suggesting that transparent, interpretable approaches outperform black-box LLM evaluation in safety-critical applications.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers developed a weak supervision framework to detect hallucinations in large language models by distilling grounding signals into transformer representations during training. Using substring matching, sentence embeddings, and LLM judges, they created a 15,000-sample dataset and trained five probing classifiers that achieve hallucination detection from internal activations alone at inference time, eliminating the need for external verification systems.
AIBearisharXiv – CS AI · Apr 77/10
🧠Researchers present a new framework for AI safety that identifies a 57-token predictive window for detecting potential failures in large language models. The study found that only one out of seven tested models showed predictive signals before committing to problematic outputs, while factual hallucinations produced no detectable warning signs.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers developed an LLM-powered evolutionary search method to automatically design uncertainty quantification systems for large language models, achieving up to 6.7% improvement in performance over manual designs. The study found that different AI models employ distinct evolutionary strategies, with some favoring complex linear estimators while others prefer simpler positional weighting approaches.
🧠 Claude🧠 Sonnet🧠 Opus
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers developed SCoOP, a training-free framework that combines multiple Vision-Language Models to improve uncertainty quantification and reduce hallucinations in AI systems. The method achieves 10-13% better hallucination detection performance compared to existing approaches while adding only microsecond-level overhead to processing time.
AINeutralarXiv – CS AI · Mar 127/10
🧠Researchers introduce TRACED, a framework that evaluates AI reasoning quality through geometric analysis rather than traditional scalar probabilities. The system identifies correct reasoning as high-progress stable trajectories, while AI hallucinations show low-progress unstable patterns with high curvature fluctuations.
AI × CryptoNeutralarXiv – CS AI · Mar 127/10
🤖Researchers propose NabaOS, a lightweight verification framework that detects AI agent hallucinations using HMAC-signed tool receipts instead of zero-knowledge proofs. The system achieves 94.2% detection accuracy with <15ms verification time, compared to cryptographic approaches that require 180+ seconds per query.