#llm-reliability News & Analysis

56 articles tagged with #llm-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

56 articles

AIBullisharXiv – CS AI · Mar 37/104

🧠

HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs

Researchers introduce HalluGuard, a new framework that identifies and addresses both data-driven and reasoning-driven hallucinations in Large Language Models. The system achieved state-of-the-art performance across 10 benchmarks and 9 LLM backbones, offering a unified approach to improve AI reliability in critical domains like healthcare and law.

AIBullisharXiv – CS AI · Jun 256/10

🧠

CausalRAG2: Hierarchical Causal Knowledge Graph Design for RAG

Researchers introduce CausalRAG2, a framework that improves retrieval-augmented generation (RAG) systems by incorporating causal reasoning into knowledge graph design, addressing limitations in current entity-centric approaches. The framework uses hierarchical modules with causal gating to reduce spurious correlations and enable scalable reasoning, accompanied by a new HolisQA benchmark for comprehensive evaluation.

AINeutralarXiv – CS AI · Jun 236/10

🧠

The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

Researchers propose a comprehensive uncertainty quantification (UQ) framework for large language models, breaking down sources of error into input-level, parameter-level, token-level, and decoding-process components. Testing 21 UQ methods across Qwen3, Llama 3.2, and DeepSeek-V3 reveals that consensus-based approaches consistently outperform alternatives, while larger models exhibit lower uncertainty estimates according to an empirical scaling law.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

Generalization of Fine-Tuned Uncertainty Communication and Metacognition in Large Language Models

Researchers demonstrate that large language models can be fine-tuned to improve uncertainty communication—aligning stated confidence with actual answer correctness—but gains don't reliably transfer across different confidence tasks or domains. Multitask training shows promise for broader generalization, addressing a critical reliability issue as LLMs are deployed in high-stakes settings.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

Researchers introduce the Argent Signaling Protocol (ASP), a structured metadata framework that helps multi-agent AI systems distinguish between repairable failures and unrecoverable errors by tagging responses with quality signals including certainty, grounding, and stochasticity. Testing across multiple language models shows significant improvements in accuracy and error containment, with particular success in blocking ungrounded information from propagating through agent pipelines.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Researchers introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that improves multi-agent debate systems by separating planning from execution, preventing logic degradation and argument repetition. In stress-tested simulations, KG-CFR maintains argument quality above 0.82 in 95% of perturbed scenarios, demonstrating that architectural decoupling enhances system resilience under sustained pressure.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

Researchers propose a meta-cognitive framework that improves Large Language Models by distinguishing between mastered knowledge, confused understanding, and missing information. The approach uses internal confidence signals to guide targeted knowledge augmentation and calibrate model certainty with actual accuracy, addressing a critical gap where LLMs often exhibit overconfidence despite knowledge deficiencies.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Researchers introduce a critic-guided multi-agent framework that improves LLM reasoning reliability for mathematical problem-solving by combining heterogeneous AI agents with adaptive feedback loops. The approach achieves 13% accuracy improvements on benchmarks while demonstrating that smaller models can match larger ones when equipped with critique mechanisms.

AINeutralarXiv – CS AI · Jun 56/10

🧠

A Taxonomy of Runtime Faults in Model Context Protocol Servers

Researchers have created the first empirical taxonomy of runtime faults in Model Context Protocol (MCP) servers, identifying 73 distinct fault types across 11 categories after analyzing 837 fault threads from 473 GitHub repositories. The study reveals that configuration parameters accepted but not enforced at runtime cause widespread reliability issues in LLM tool-augmentation workflows, with developer surveys confirming that these faults are commonly experienced across the industry.

AINeutralarXiv – CS AI · Jun 26/10

🧠

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

Researchers present a method to improve error prediction in Large Language Models by distinguishing between genuine model uncertainty and input ambiguity. Using uncertainty quantification metrics on question-answering tasks, they demonstrate that ambiguity information significantly enhances error prediction accuracy, yielding improvements exceeding 10 percentage points across multiple datasets and model families.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

Researchers present LLMFI, a fault-injection framework that systematically studies how hardware errors propagate through large language model inference across multiple domains. The study identifies critical vulnerability patterns and proposes four software-only reliability improvements, providing practical guidance for deploying LLMs in high-performance computing environments.

AINeutralarXiv – CS AI · Jun 16/10

🧠

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

Researchers formalize a theoretical framework distinguishing between universal LLM reliability (impossible across unbounded domains) and patch-local reliability (achievable within operationally bounded systems). The work proposes that deployed AI systems can achieve practical reliability by focusing on recurring failure modes within specific contexts rather than attempting universal solutions.

AINeutralarXiv – CS AI · May 286/10

🧠

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

Researchers propose Sequential Bayesian Belief Tracking (SBBT), a framework for estimating the reliability of long reasoning chains in large language models before final answers are known. The study finds that probability calibration and ranking performance respond differently to various evidence types: scalar scores improve calibration metrics, while structural observations are needed for ranking tasks.

AINeutralarXiv – CS AI · May 286/10

🧠

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

Researchers present CODE, a novel approach to knowledge editing in large language models that replaces fact overwriting with causal reasoning. By embedding causal narratives and on-policy distillation into model parameters, CODE reduces self-refutation rates from 95.6% to 1.8%, enabling LLMs to evolve knowledge coherently rather than storing isolated facts.

AINeutralarXiv – CS AI · May 276/10

🧠

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Researchers introduce LexGuard, an adversarial AI framework that improves legal reasoning in large language models by distinguishing legally relevant changes from irrelevant perturbations. The system uses formal logic and SMT solvers to ground legal decisions in statute interpretation, addressing systematic failures in existing legal AI systems to maintain appropriate sensitivity to material legal facts.

AIBullisharXiv – CS AI · May 276/10

🧠

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

Researchers developed Chat-ISV, an LLM-enhanced knowledge graph system that organizes fragmented steel industry VOCs literature into a queryable database with 27,180 nodes and 81,779 semantic edges. The system achieved 96.93% precision in answering specialized industrial questions, demonstrating a scalable approach to deploying reliable LLMs in domain-specific applications where hallucination risks are high.

AIBullisharXiv – CS AI · May 276/10

🧠

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

Researchers present SeDT, a training-free method that improves large language model performance in multi-turn conversations by annotating conversation history with relevance scores, addressing a documented 39% performance drop when tasks are revealed incrementally across multiple turns.

AINeutralarXiv – CS AI · May 276/10

🧠

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

Researchers introduce ContextGuard, a self-auditing framework that addresses a critical gap in large language model performance: the inability to faithfully apply complex contextual knowledge despite strong reasoning capabilities. The system identifies and corrects failures where models miss peripheral, persistent, or format-sensitive requirements while following main reasoning paths.

AINeutralarXiv – CS AI · May 276/10

🧠

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

Researchers demonstrate that autonomous AI agents can exceed human performance in supply chain management using the MIT Beer Game, yet reveal critical reliability issues including 'agent bullwhip'—amplified decision instability across multi-level systems. A reinforcement learning framework using Group Relative Policy Optimization successfully mitigates this instability and improves reliability.

AINeutralarXiv – CS AI · May 276/10

🧠

How Reliable are LLMs for Reasoning on the Re-ranking task?

Researchers investigate whether Large Language Models reliably perform re-ranking tasks by analyzing how different training methods affect semantic understanding and reasoning transparency. The study reveals that some training approaches produce better explainability than others, suggesting LLMs may optimize for evaluation metrics rather than genuine semantic comprehension, raising concerns about their actual reliability in ranking applications.

AINeutralarXiv – CS AI · May 126/10

🧠

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

NoisyCoconut is an inference-time method that improves LLM reliability by injecting controlled noise into internal representations to generate diverse reasoning paths, enabling models to abstain when uncertain without requiring retraining. The technique reduces error rates from 40-70% to below 15% on mathematical reasoning tasks through unanimous agreement among noise-perturbed paths, offering practical reliability improvements compatible with existing models.

AINeutralarXiv – CS AI · May 126/10

🧠

A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

Researchers present a communication-theoretic framework that unifies LLM reliability techniques (retry, majority voting, self-consistency) under classical information theory, introducing a cost-aware router that achieves 56% lower costs than fixed approaches while maintaining quality. The work demonstrates that no single reliability technique dominates across all tasks, supporting dynamic per-task allocation strategies.

AINeutralarXiv – CS AI · May 116/10

🧠

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Researchers present a scale-conditioned evaluation protocol for AI agent memory systems that tests whether stored evidence remains usable as irrelevant data accumulates. Testing across multiple memory architectures and language models reveals that reliability degrades unpredictably with scale, with some models exceeding computational budgets while others maintain performance, suggesting memory scalability claims must be conditioned on specific agent-interface-scale combinations.

AINeutralarXiv – CS AI · May 116/10

🧠

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

Researchers propose using multidimensional self-assessment based on cognitive appraisal theory to predict LLM failures more reliably than confidence alone. Testing across 12 models and 38 tasks, they find effort and ability dimensions consistently outperform confidence, with task type determining which dimension proves most predictive.

AIBullisharXiv – CS AI · May 96/10

🧠

Towards Dependable Retrieval-Augmented Generation Using Factual Confidence Prediction

Researchers propose a two-stage approach to improve reliability in retrieval-augmented generation (RAG) systems by using conformal prediction to filter retrieved content and an attention-based classifier to detect factual inconsistencies. The framework achieves up to 6% answer quality improvement and 77% inconsistency detection, advancing toward certified RAG systems for production AI applications.

← PrevPage 2 of 3Next →