#llm-reliability News & Analysis

56 articles tagged with #llm-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

56 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

Researchers demonstrate that internal computational artifacts within Large Language Models can reliably detect when the model produces incorrect outputs in legal classification tasks. By analyzing these internal signals, downstream classifiers can identify hallucinated or erroneous predictions, potentially improving the reliability of LLM-based legal systems for high-stakes applications like bail decisions and statute violation predictions.

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

Researchers have identified "chameleon behavior" in search-enabled large language models, where they inconsistently shift stances when presented with contradictory questions in multi-turn conversations. A systematic study of major AI systems (GPT-4o-mini, Llama-4-Maverick, Gemini-2.5-Flash) reveals severe stance instability scores (0.391-0.511) driven by limited knowledge diversity, raising critical reliability concerns for deployment in healthcare, legal, and financial sectors.

🧠 GPT-4🧠 Gemini🧠 Llama

AINeutralarXiv – CS AI · Jun 197/10

🧠

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

Researchers present a comprehensive evaluation framework for black-box uncertainty estimation methods in large language models, benchmarking 24 methods across 4 models and datasets. The study reveals that no single approach dominates universally, but hybrid methods combining multiple uncertainty signals and candidate-reasoning approaches consistently outperform others, addressing critical gaps in trustworthy LLM deployment.

AINeutralarXiv – CS AI · Jun 197/10

🧠

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

Researchers demonstrate that Large Language Models lack genuine self-awareness regarding their knowledge limitations when applied to clinical tabular data, using cross-model attribution divergence to detect epistemic blind spots. LLM confidence scores remain constant regardless of actual accuracy, while a novel cross-model calibrator achieves reliable uncertainty quantification without model access or retraining.

AIBullisharXiv – CS AI · Jun 107/10

🧠

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

Researchers propose a conflict-aware paradigm for large language models that dynamically balances external context against parametric knowledge, addressing failures in existing contrastive decoding methods. The work introduces Adaptive Regime Routing (ARR) to resolve fundamental asymmetries in how models handle contradictory information, improving resistance to erroneous context by 3-5x while maintaining performance on correct context.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

Researchers propose Global-Local Uncertainty (GLU), a new method for quantifying uncertainty in large language models by combining hidden-state geometric entropy with token-level signals. The approach successfully identifies confident-but-wrong predictions that existing token-only methods miss, offering improved reliability assessment across multiple model families.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

Researchers propose Inference-Time Conformal Reasoning (ITCR), a framework that integrates conformal prediction directly into LLM reasoning generation to provide mathematically valid factuality guarantees. The method addresses the structural nature of uncertainty in multi-step reasoning by calibrating when to stop generation based on graph-level factuality signals, delivering more accurate outputs than post-hoc correction approaches.

AIBullisharXiv – CS AI · Jun 97/10

🧠

FASE: Fast Adaptive Semantic Entropy for Code Quality

Researchers introduce FASE (Fast Adaptive Semantic Entropy), a novel metric for evaluating code quality in multi-agent AI systems that reduces computational costs by 99.7% while improving accuracy by 25% compared to existing semantic entropy methods. The approach uses structural and semantic dissimilarity graphs instead of expensive LLM-driven equivalence checks, offering practical uncertainty quantification for autonomous software development.

AIBullisharXiv – CS AI · Jun 87/10

🧠

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Researchers introduce MACD, a new inference strategy that reduces hallucinations in video language models by using the model's own feedback to identify problematic visual regions and generate targeted counterfactual data. The method combines model-aware object-level modifications with contrastive decoding, showing consistent improvements across multiple benchmarks and video-LLM architectures.

AIBearisharXiv – CS AI · Jun 57/10

🧠

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

A new study reveals that large language models can identify fabricated statistics in isolation but fail to apply this capability when synthesizing multiple sources, instead weighting sources based on analytical presentation style rather than numeric validity. This 'epistemic alignment' failure—where models prioritize how credible something sounds over whether it's actually true—persists across multiple model families and domains, with attempted fixes through prompting producing blanket skepticism rather than selective discernment.

🧠 Claude

AIBearisharXiv – CS AI · Jun 47/10

🧠

The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

Researchers discovered that incidental contextual cues in prompts systematically steer LLM code generation toward different algorithms, even when all outputs are functionally correct. Across 46,535 experiments, subtle variations in wording and metadata produced algorithm-choice shifts up to 100 percentage points, creating unpredictable performance and security outcomes in production code.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Researchers present a self-healing orchestration framework for tool-augmented large language models that treats reliability as a bounded runtime control problem, achieving 98.8% task success by mapping failure signals to recovery actions and verifying results. The approach outperforms retry-only and full-replanning baselines across multiple benchmarks, particularly excelling when recovery budgets are constrained.

AIBearisharXiv – CS AI · Jun 27/10

🧠

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Researchers demonstrate that Large Language Models exhibit significant limitations in zero-shot annotation tasks, with only 34.8% of initial errors correctable through prompting. The study reveals that model-internalized priors and concept definitions strongly influence LLM performance more than text-level memorization, highlighting fundamental constraints in LLM adaptability for reliable AI-as-a-judge applications.

AINeutralarXiv – CS AI · Jun 17/10

🧠

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Researchers introduce EHRBench, an automated benchmark containing nearly 1 million QA items derived from real patient electronic health records to evaluate large language models on clinical decision-making tasks. The framework combines LLM-based template generation with knowledge-base verification to assess model performance on diagnosis, treatment, and prognosis at scale while maintaining reliability.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA

Researchers propose DCRC, a data-centric framework addressing numerical hallucinations in LLM-based financial question-answering systems. The approach combines adversarial data construction, multi-stage training, and executable reasoning programs to improve reliability in high-stakes financial applications where accuracy is critical.

AIBearishArs Technica – AI · May 287/10

🧠

LLMs believe false statements even after explicit warnings that they're false

Research demonstrates that large language models persistently represent false statements as true even after explicit corrections, exhibiting a systematic bias toward confident affirmation regardless of accuracy. This finding reveals a fundamental vulnerability in LLM reliability that has implications for applications requiring factual precision.

AINeutralarXiv – CS AI · May 287/10

🧠

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Researchers propose Faithful Agentic XAI (FAX), a framework that improves the reliability of AI explanations generated by large language models through explicit verification mechanisms. The study introduces CRAFTER-XAI-Bench, a new benchmark for testing explanation faithfulness in complex environments, demonstrating that current XAI systems can produce plausible but inaccurate explanations that mislead users.

AIBearisharXiv – CS AI · May 127/10

🧠

The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

Researchers demonstrate that large language models encode temporal knowledge drift—whether facts have become outdated since training—as a geometrically orthogonal direction in their internal representations, separate from correctness and uncertainty signals. This structural property explains why existing detection methods fail and why LLMs confidently produce outdated information, with implications for AI reliability and deployment.

AINeutralarXiv – CS AI · May 127/10

🧠

The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

Researchers identify a critical vulnerability in agentic memory systems where Large Language Models retrieve and amplify spurious correlations from stored information, leading to erroneous reasoning in downstream decisions. The study benchmarks this risk and proposes CAMEL, a lightweight calibration method that mitigates spurious pattern reliance while maintaining performance on clean data.

AIBearisharXiv – CS AI · May 127/10

🧠

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

Researchers identified epistemic overreach in LLM-generated explanations of personal sensing data, where AI systems produce coherent-sounding narratives about anomalous days without sufficient evidentiary support. Testing 14,922 explanations across three LLM families revealed that models routinely attribute causes without data justification, and this problem persists even when provided richer context or explicit instructions to constrain claims.

🧠 Llama

AIBullisharXiv – CS AI · May 117/10

🧠

A Self-Healing Framework for Reliable LLM-Based Autonomous Agents

Researchers propose a self-healing framework for LLM-based autonomous agents that addresses critical reliability issues including hallucinations, execution errors, and reasoning inconsistencies. The framework combines failure detection, reliability assessment, and automated recovery mechanisms, demonstrating significant improvements in task success rates and system robustness in multi-agent environments.

AIBullisharXiv – CS AI · Apr 147/10

🧠

CircuitSynth: Reliable Synthetic Data Generation

CircuitSynth is a neuro-symbolic framework that addresses hallucinations and logical inconsistencies in LLM-generated synthetic data by combining probabilistic decision diagrams with optimization mechanisms to enforce hard constraints and distributional guarantees. The approach achieves 100% schema validity across complex benchmarks while outperforming existing methods in coverage, representing a significant advancement in reliable synthetic data generation for machine learning applications.

AIBearisharXiv – CS AI · Apr 107/10

🧠

Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research

Researchers discovered that GPT-4o exhibits significant daily and weekly performance fluctuations when solving identical tasks under fixed conditions, with periodic variability accounting for approximately 20% of total variance. This finding fundamentally challenges the widespread assumption that LLM performance is time-invariant and raises critical concerns about the reliability and reproducibility of research utilizing large language models.

🧠 GPT-4

AIBullisharXiv – CS AI · Apr 107/10

🧠

Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs

Researchers propose SciDC, a method that constrains large language model outputs using subject-specific scientific rules to reduce hallucinations and improve reliability. The approach demonstrates 12% average accuracy improvements across domain tasks including drug formulation, clinical diagnosis, and chemical synthesis planning.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Researchers developed a weak supervision framework to detect hallucinations in large language models by distilling grounding signals into transformer representations during training. Using substring matching, sentence embeddings, and LLM judges, they created a 15,000-sample dataset and trained five probing classifiers that achieve hallucination detection from internal activations alone at inference time, eliminating the need for external verification systems.

Page 1 of 3Next →