AIBullisharXiv – CS AI · 4h ago7/10
🧠Researchers introduce MACD, a new inference strategy that reduces hallucinations in video language models by using the model's own feedback to identify problematic visual regions and generate targeted counterfactual data. The method combines model-aware object-level modifications with contrastive decoding, showing consistent improvements across multiple benchmarks and video-LLM architectures.
AIBearisharXiv – CS AI · 3d ago7/10
🧠A new study reveals that large language models can identify fabricated statistics in isolation but fail to apply this capability when synthesizing multiple sources, instead weighting sources based on analytical presentation style rather than numeric validity. This 'epistemic alignment' failure—where models prioritize how credible something sounds over whether it's actually true—persists across multiple model families and domains, with attempted fixes through prompting producing blanket skepticism rather than selective discernment.
🧠 Claude
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers discovered that incidental contextual cues in prompts systematically steer LLM code generation toward different algorithms, even when all outputs are functionally correct. Across 46,535 experiments, subtle variations in wording and metadata produced algorithm-choice shifts up to 100 percentage points, creating unpredictable performance and security outcomes in production code.
AIBearisharXiv – CS AI · 6d ago7/10
🧠Researchers demonstrate that Large Language Models exhibit significant limitations in zero-shot annotation tasks, with only 34.8% of initial errors correctable through prompting. The study reveals that model-internalized priors and concept definitions strongly influence LLM performance more than text-level memorization, highlighting fundamental constraints in LLM adaptability for reliable AI-as-a-judge applications.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers present a self-healing orchestration framework for tool-augmented large language models that treats reliability as a bounded runtime control problem, achieving 98.8% task success by mapping failure signals to recovery actions and verifying results. The approach outperforms retry-only and full-replanning baselines across multiple benchmarks, particularly excelling when recovery budgets are constrained.
AIBullisharXiv – CS AI · Jun 17/10
🧠Researchers propose DCRC, a data-centric framework addressing numerical hallucinations in LLM-based financial question-answering systems. The approach combines adversarial data construction, multi-stage training, and executable reasoning programs to improve reliability in high-stakes financial applications where accuracy is critical.
AINeutralarXiv – CS AI · Jun 17/10
🧠Researchers introduce EHRBench, an automated benchmark containing nearly 1 million QA items derived from real patient electronic health records to evaluate large language models on clinical decision-making tasks. The framework combines LLM-based template generation with knowledge-base verification to assess model performance on diagnosis, treatment, and prognosis at scale while maintaining reliability.
AIBearishArs Technica – AI · May 287/10
🧠Research demonstrates that large language models persistently represent false statements as true even after explicit corrections, exhibiting a systematic bias toward confident affirmation regardless of accuracy. This finding reveals a fundamental vulnerability in LLM reliability that has implications for applications requiring factual precision.
AINeutralarXiv – CS AI · May 287/10
🧠Researchers propose Faithful Agentic XAI (FAX), a framework that improves the reliability of AI explanations generated by large language models through explicit verification mechanisms. The study introduces CRAFTER-XAI-Bench, a new benchmark for testing explanation faithfulness in complex environments, demonstrating that current XAI systems can produce plausible but inaccurate explanations that mislead users.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that large language models encode temporal knowledge drift—whether facts have become outdated since training—as a geometrically orthogonal direction in their internal representations, separate from correctness and uncertainty signals. This structural property explains why existing detection methods fail and why LLMs confidently produce outdated information, with implications for AI reliability and deployment.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers identified epistemic overreach in LLM-generated explanations of personal sensing data, where AI systems produce coherent-sounding narratives about anomalous days without sufficient evidentiary support. Testing 14,922 explanations across three LLM families revealed that models routinely attribute causes without data justification, and this problem persists even when provided richer context or explicit instructions to constrain claims.
🧠 Llama
AINeutralarXiv – CS AI · May 127/10
🧠Researchers identify a critical vulnerability in agentic memory systems where Large Language Models retrieve and amplify spurious correlations from stored information, leading to erroneous reasoning in downstream decisions. The study benchmarks this risk and proposes CAMEL, a lightweight calibration method that mitigates spurious pattern reliance while maintaining performance on clean data.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose a self-healing framework for LLM-based autonomous agents that addresses critical reliability issues including hallucinations, execution errors, and reasoning inconsistencies. The framework combines failure detection, reliability assessment, and automated recovery mechanisms, demonstrating significant improvements in task success rates and system robustness in multi-agent environments.
AIBullisharXiv – CS AI · Apr 147/10
🧠CircuitSynth is a neuro-symbolic framework that addresses hallucinations and logical inconsistencies in LLM-generated synthetic data by combining probabilistic decision diagrams with optimization mechanisms to enforce hard constraints and distributional guarantees. The approach achieves 100% schema validity across complex benchmarks while outperforming existing methods in coverage, representing a significant advancement in reliable synthetic data generation for machine learning applications.
AIBearisharXiv – CS AI · Apr 107/10
🧠Researchers discovered that GPT-4o exhibits significant daily and weekly performance fluctuations when solving identical tasks under fixed conditions, with periodic variability accounting for approximately 20% of total variance. This finding fundamentally challenges the widespread assumption that LLM performance is time-invariant and raises critical concerns about the reliability and reproducibility of research utilizing large language models.
🧠 GPT-4
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose SciDC, a method that constrains large language model outputs using subject-specific scientific rules to reduce hallucinations and improve reliability. The approach demonstrates 12% average accuracy improvements across domain tasks including drug formulation, clinical diagnosis, and chemical synthesis planning.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers developed a weak supervision framework to detect hallucinations in large language models by distilling grounding signals into transformer representations during training. Using substring matching, sentence embeddings, and LLM judges, they created a 15,000-sample dataset and trained five probing classifiers that achieve hallucination detection from internal activations alone at inference time, eliminating the need for external verification systems.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers introduce HalluGuard, a new framework that identifies and addresses both data-driven and reasoning-driven hallucinations in Large Language Models. The system achieved state-of-the-art performance across 10 benchmarks and 9 LLM backbones, offering a unified approach to improve AI reliability in critical domains like healthcare and law.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce a critic-guided multi-agent framework that improves LLM reasoning reliability for mathematical problem-solving by combining heterogeneous AI agents with adaptive feedback loops. The approach achieves 13% accuracy improvements on benchmarks while demonstrating that smaller models can match larger ones when equipped with critique mechanisms.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers have created the first empirical taxonomy of runtime faults in Model Context Protocol (MCP) servers, identifying 73 distinct fault types across 11 categories after analyzing 837 fault threads from 473 GitHub repositories. The study reveals that configuration parameters accepted but not enforced at runtime cause widespread reliability issues in LLM tool-augmentation workflows, with developer surveys confirming that these faults are commonly experienced across the industry.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers present a method to improve error prediction in Large Language Models by distinguishing between genuine model uncertainty and input ambiguity. Using uncertainty quantification metrics on question-answering tasks, they demonstrate that ambiguity information significantly enhances error prediction accuracy, yielding improvements exceeding 10 percentage points across multiple datasets and model families.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers present LLMFI, a fault-injection framework that systematically studies how hardware errors propagate through large language model inference across multiple domains. The study identifies critical vulnerability patterns and proposes four software-only reliability improvements, providing practical guidance for deploying LLMs in high-performance computing environments.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers formalize a theoretical framework distinguishing between universal LLM reliability (impossible across unbounded domains) and patch-local reliability (achievable within operationally bounded systems). The work proposes that deployed AI systems can achieve practical reliability by focusing on recurring failure modes within specific contexts rather than attempting universal solutions.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers present CODE, a novel approach to knowledge editing in large language models that replaces fact overwriting with causal reasoning. By embedding causal narratives and on-policy distillation into model parameters, CODE reduces self-refutation rates from 95.6% to 1.8%, enabling LLMs to evolve knowledge coherently rather than storing isolated facts.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers propose Sequential Bayesian Belief Tracking (SBBT), a framework for estimating the reliability of long reasoning chains in large language models before final answers are known. The study finds that probability calibration and ranking performance respond differently to various evidence types: scalar scores improve calibration metrics, while structural observations are needed for ranking tasks.