#model-reliability News & Analysis

54 articles tagged with #model-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

54 articles

AINeutralarXiv – CS AI · Jun 257/10

🧠

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

Researchers demonstrate that language models with corrupted memory systems produce confident false answers, while models without memory abstain appropriately. A source-first compression strategy that preserves reasoning steps over conclusions restores correctability and prevents error propagation through chained interactions.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Researchers demonstrate that large language model agents fail to maintain plans as persistent internal state, instead relying on plans remaining in the context window. Using diagnostic techniques on Llama-3.1-70B and DeepSeek-R1, the study shows plan signal decays rapidly when compressed out of context, with practical implications for agent reliability in long-horizon tasks.

🧠 Llama

AIBearisharXiv – CS AI · Jun 117/10

🧠

Can AI Agents Synthesize Scientific Conclusions?

Researchers introduced SciConBench, a benchmark evaluating AI agents' ability to synthesize scientific conclusions from systematic reviews. Testing eight frontier models and research agents under controlled conditions revealed fundamental limitations: the best-performing agent achieved only 0.337 factual F1 score, with consumer-facing tools like Google AI Overview generating incomplete or contradictory conclusions despite available ground-truth answers.

🏢 Google

AIBearisharXiv – CS AI · Jun 107/10

🧠

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Researchers introduce τ-Rec, a new benchmark for evaluating conversational AI recommender systems that replaces subjective LLM-based judging with verifiable, measurable rewards. Testing across nine model configurations reveals a critical reliability gap, with even top-performing models achieving only ~57% accuracy on single-attempt tasks, exposing significant limitations in current agentic AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBearisharXiv – CS AI · Jun 107/10

🧠

PhantomBench: Benchmarking the Non-existential Threat of Language Models

Researchers introduced PhantomBench, a large-scale benchmark containing over 60,000 non-existent terms and entities, to evaluate how well language models recognize the limits of their knowledge. Testing 21 models revealed alarming hallucination rates up to 86.7%, demonstrating that even frontier models fail to abstain from generating responses about concepts that don't exist.

AIBearisharXiv – CS AI · Jun 107/10

🧠

AMEL: Accumulated Message Effects on LLM Judgments

Researchers discovered that large language models exhibit systematic bias in evaluations based on prior conversation history, with models shifting judgments toward the polarity of preceding items. The effect persists across 12 models from major providers and is stronger for uncertain cases and negative histories, raising concerns for applications relying on LLM-based automated evaluation.

🏢 OpenAI🏢 Anthropic🧠 GPT-5

AIBearisharXiv – CS AI · Jun 47/10

🧠

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

Researchers evaluated Vision-Language-Action models in autonomous driving under sensor degradation, finding that explanation consistency (Chain-of-Causation) strongly correlates with trajectory reliability. When model explanations change due to perturbations like fog or noise, trajectory errors increase 5.3x, suggesting reasoning consistency could serve as a safety monitoring tool for autonomous vehicles.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping

Researchers introduce DeLask, a novel decoding framework that reduces hallucinations in Large Language Models by dynamically skipping decoder layers prone to generating false information. The method uses gradient-based analysis to identify problematic layers and partially aggregates their hidden states, demonstrating consistent improvements across diverse LLMs without requiring model retraining.

AIBullisharXiv – CS AI · Jun 27/10

🧠

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

TriLens is a novel white-box detection method that identifies hallucinations in language models by tracking entropy changes across internal computational layers. Rather than examining only final outputs, the technique monitors uncertainty signals from multi-head attention, feed-forward networks, and residual streams using logit lens analysis, creating a compact 3L-dimensional trajectory that reveals how model confidence settles during inference.

AIBullisharXiv – CS AI · May 277/10

🧠

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Researchers introduce PaTAS (Parallel Trust Assessment System), a framework that uses Subjective Logic to measure and propagate trust through neural networks alongside standard computation. The system identifies reliability gaps and adversarial vulnerabilities that traditional metrics like accuracy fail to detect, offering a foundation for deploying AI safely in critical applications.

AIBearisharXiv – CS AI · May 127/10

🧠

Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models

Researchers developed a testing framework to study "political plasticity"—how Large Language Models adapt their ideological responses based on user context. The study found that newer, larger LLMs reliably shift responses along economic and personal freedom axes when prompted with few-shot examples, while older models show limited adaptability, raising concerns about potential data leakage and model reliability.

AINeutralarXiv – CS AI · May 127/10

🧠

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Researchers challenge the widespread assumption that sharp attention maps in vision-language models indicate reliable outputs. Through mechanistic analysis of three VLM families (LLaVA, PaliGemma, Qwen2-VL), they find attention structure is nearly uncorrelated with correctness, while hidden-state geometry and late-layer circuits prove far more predictive of model reliability.

AIBullisharXiv – CS AI · May 117/10

🧠

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.

AIBullisharXiv – CS AI · May 77/10

🧠

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

Researchers introduce RFT-FaultBench, the first comprehensive benchmark for diagnosing failures in reinforcement fine-tuning of large language models, and propose RFT-FM, an automated framework for detecting, diagnosing, and remediating training failures. This addresses a critical gap in LLM post-training reliability where practitioners currently rely on manual inspection.

AIBearisharXiv – CS AI · Apr 207/10

🧠

The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

Researchers demonstrate that enhancing LLM reasoning capabilities through reinforcement learning paradoxically increases tool hallucination—where models incorrectly invoke non-existent or inappropriate tools. The study reveals a fundamental trade-off where stronger reasoning correlates with higher hallucination rates, suggesting current AI agent development approaches may inherently compromise reliability for capability.

🏢 OpenAI

AINeutralarXiv – CS AI · Apr 157/10

🧠

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Researchers introduce VLM-DeflectionBench, a new benchmark with 2,775 samples designed to evaluate how large vision-language models handle conflicting or insufficient evidence. The study reveals that most state-of-the-art LVLMs fail to appropriately deflect when faced with noisy or misleading information, highlighting critical gaps in model reliability for knowledge-intensive tasks.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Sanity Checks for Agentic Data Science

Researchers propose lightweight sanity checks for agentic data science (ADS) systems to detect falsely optimistic conclusions that users struggle to identify. Using the Predictability-Computability-Stability framework, the checks expose whether AI agents like OpenAI Codex reliably distinguish signal from noise. Testing on 11 real datasets reveals that over half produced unsupported affirmative conclusions despite individual runs suggesting otherwise.

🏢 OpenAI

AIBullisharXiv – CS AI · Apr 77/10

🧠

Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Researchers developed an LLM-powered evolutionary search method to automatically design uncertainty quantification systems for large language models, achieving up to 6.7% improvement in performance over manual designs. The study found that different AI models employ distinct evolutionary strategies, with some favoring complex linear estimators while others prefer simpler positional weighting approaches.

🧠 Claude🧠 Sonnet🧠 Opus

AINeutralarXiv – CS AI · Mar 97/10

🧠

Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering

Researchers evaluated 34 large language models on radiology questions, finding that agentic retrieval-augmented reasoning systems improve consensus and reliability across different AI models. The study shows these systems reduce decision variability between models and increase robust correctness, though 72% of incorrect outputs still carried moderate to high clinical severity.

AINeutralarXiv – CS AI · Mar 57/10

🧠

Certainty robustness: Evaluating LLM stability under self-challenging prompts

Researchers introduce the Certainty Robustness Benchmark, a new evaluation framework that tests how large language models handle challenges to their responses in interactive settings. The study reveals significant differences in how AI models balance confidence and adaptability when faced with prompts like "Are you sure?" or "You are wrong!", identifying a critical new dimension for AI evaluation.

AINeutralarXiv – CS AI · Mar 47/103

🧠

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Researchers developed new selective classification methods using likelihood ratio tests based on the Neyman-Pearson lemma, allowing AI models to abstain from uncertain predictions. The approach shows superior performance across vision and language tasks, particularly under covariate shift scenarios where test data differs from training data.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Latent Confidence Alignment for LLM Self-Assessment

Researchers propose Latent Confidence Alignment Error (LCAE), a new framework for evaluating how well large language models assess their own reliability by accounting for item difficulty and model ability. Testing on 20 medical-domain models shows the approach improves self-assessment quality without degrading performance, revealing a correlation between model reliability and computational inference costs.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Toward Calibrated Mixture-of-Experts Under Distribution Shift

Researchers demonstrate that calibration—aligning model confidence with actual accuracy—behaves differently in mixture-of-experts (MoE) models depending on routing mechanisms. While expert-level calibration suffices for hard-routed models under distribution shift, soft-routed models require additional adversarial reweighting techniques to maintain both accuracy and calibration reliability.

AINeutralarXiv – CS AI · Jun 96/10

🧠

BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models

Researchers introduce BEACON, a black-box hallucination detection framework for large language models that achieves 81.23% accuracy by analyzing model outputs without requiring internal access. The method combines multiple uncertainty signals including semantic entropy and consistency checks, outperforming existing baselines and offering practical deployment options across commercial LLM APIs.

AIBearisharXiv – CS AI · Jun 96/10

🧠

The Confidence Trap: Calibration Attacks for Graph Neural Networks

Researchers have developed a Unified Graph Calibration Attack (UGCA) framework that exploits vulnerabilities in Graph Neural Networks' confidence calibration through adversarial structural perturbations. The study reveals that GNNs with higher accuracy or trained on complex datasets are more susceptible to calibration attacks, which increase prediction uncertainty while maintaining classification accuracy.

Page 1 of 3Next →