#ai-reliability News & Analysis

255 articles tagged with #ai-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

255 articles

AIBearisharXiv – CS AI · May 277/10

🧠

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

Researchers discovered that retrieval-augmented language models exhibit a critical safety gap: they can detect contradictory information in accumulated evidence but fail to incorporate this awareness into their final recommendations. Testing across model families showed single-turn safety evaluations significantly overestimate real-world robustness in multi-turn scenarios where evidence accumulates.

AINeutralarXiv – CS AI · May 277/10

🧠

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Researchers have identified the mechanistic causes of hallucinations in large language models when reasoning over structured knowledge like graphs and tables. The study reveals that hallucinations stem from systematic failures in attention allocation and semantic grounding in feed-forward layers, rather than random errors, with findings applicable across multiple structured knowledge formats.

AIBearisharXiv – CS AI · May 277/10

🧠

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

Researchers challenge the assumption that uncertainty estimation methods can reliably detect LLM hallucinations, finding highly variable and often weak associations across different hallucination types. The study evaluates multiple uncertainty quantification approaches against intrinsic and extrinsic hallucinations, revealing that uncertainty signals may not consistently indicate model failures.

AIBullisharXiv – CS AI · May 277/10

🧠

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Researchers propose a reinforcement learning framework that enables medical AI agents to achieve synergistic tool use by selecting appropriate diagnostic and treatment tools on a per-instance basis rather than relying on single fixed tools. The approach addresses the critical challenge that individual medical tools frequently fail on difficult cases, which conventional task-level selection cannot overcome, potentially improving safety and reliability in clinical AI systems.

AIBearishDecrypt – AI · May 257/10

🧠

Famed iPhone, Sony Hacker Says AI Coding Agents Are a Disaster Waiting to Happen

George Hotz, the renowned iPhone and Sony hacker, has publicly warned that AI coding agents pose serious risks after testing them on real projects for six months. He contends that these agents are generating undetectable low-quality code at scale, creating problems that large organizations may not discover until significant damage has occurred.

$AVAX

AIBullisharXiv – CS AI · May 127/10

🧠

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Researchers present PROBE, a framework that improves how AI software engineering agents recover from failures by converting runtime telemetry into structured diagnoses and bounded recovery guidance. The system achieves 65% diagnosis accuracy and 21.8% recovery rates on previously unresolved cases, with a prototype deployed at Microsoft showing practical viability without disrupting existing workflows.

AINeutralarXiv – CS AI · May 127/10

🧠

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Researchers introduced AgentCollabBench, a diagnostic benchmark revealing critical vulnerabilities in multi-agent AI systems where constraints silently fail during peer collaboration. The study demonstrates that communication topology—not model capability alone—determines whether safeguards survive information handoffs between agents, exposing structural weaknesses invisible to standard outcome-based evaluation.

🧠 GPT-4🧠 Gemini🧠 Llama

AIBearisharXiv – CS AI · May 127/10

🧠

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

Researchers introduce EnvTrustBench, a benchmarking framework that identifies evidence-grounding defects (EGDs) in LLM agents—failures where agents act on stale, incorrect, or malicious environmental data without verification. Testing across 6 LLM backbones and 5 agent scaffolds reveals consistent vulnerabilities, exposing a critical reliability gap in agent systems that increasingly interact with real-world APIs, files, and logs.

AIBullisharXiv – CS AI · May 127/10

🧠

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Researchers identify a fundamental geometric flaw in decoder-based Vision-Language Models where visual embeddings become over-aligned with linguistic patterns, causing systematic hallucinations. The study introduces quantitative methods to characterize this bias and proposes training-free and fine-tuning solutions that reduce hallucinations across multiple benchmarks without computational overhead.

AINeutralarXiv – CS AI · May 127/10

🧠

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Researchers propose an outcome evidence reporting layer to improve the reliability of interactive agent benchmarks by explicitly tracking which runs have sufficient evidence of success versus uncertain cases. The framework evaluates five major AI benchmarks and reveals that surface-level outcome checks often fail to verify whether agents actually achieved intended results, making reported scores potentially misleading.

AIBearisharXiv – CS AI · May 117/10

🧠

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

Researchers auditing 2.5 million scientific papers found 146,932 hallucinated citations in 2025 alone, with non-existent references surging sharply after LLM adoption. The errors concentrate in AI-heavy fields and papers with linguistic signatures of AI assistance, while current journal moderation fails to catch most instances, threatening scientific integrity and reinforcing existing biases in academic credit attribution.

AINeutralarXiv – CS AI · May 117/10

🧠

Tracing Uncertainty in Language Model "Reasoning"

Researchers have developed a method to predict whether language model reasoning traces produce correct answers by analyzing uncertainty profiles—patterns in model confidence across generated token sequences. The approach achieves 80.7% accuracy in detecting errors and can identify failures within the first few hundred tokens, providing insights into how LLMs actually perform reasoning tasks.

AIBearisharXiv – CS AI · May 97/10

🧠

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Researchers found that large language models frequently arrive at correct code predictions through flawed reasoning, with performance dropping up to 70% when code undergoes semantics-preserving mutations. The study reveals substantial gaps between apparent accuracy and genuine semantic understanding, questioning the reliability of LLMs for critical programming tasks.

AINeutralarXiv – CS AI · May 97/10

🧠

On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

Researchers demonstrate that standard fine-tuning of transformer models on causal reasoning tasks causes catastrophic collapse where models learn trivial solutions while appearing accurate. They propose a semantic loss function with graph-based constraints that prevents collapse and achieves stable, context-dependent causal reasoning with 42.7% improvement over baseline models.

AINeutralarXiv – CS AI · May 97/10

🧠

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

A systematic review of 114 studies reveals that code quality defects in large language models stem primarily from training data imperfections rather than model limitations alone. The research establishes a taxonomy linking 18 propagation mechanisms between data quality issues and generated code failures, while advocating for proactive data governance over reactive post-generation filtering.

AIBullisharXiv – CS AI · May 97/10

🧠

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

Researchers introduce TACT, a technique using activation steering to detect and correct 'agent drift' in language model coding agents, where models either repeatedly reason over known information or issue tool calls without proper reasoning. The method improves task resolution rates by 4.8-5.8 percentage points across multiple benchmarks while reducing steps needed to complete tasks by up to 26%.

AIBearisharXiv – CS AI · May 77/10

🧠

Seeing the Goal, Missing the Truth: Human Accountability for AI Bias

Research shows that Large Language Models exhibit measurable bias when their downstream purpose is revealed, even when generating supposedly task-independent metrics. This bias stems from human research design choices rather than algorithmic flaws, raising critical questions about how AI systems are deployed in financial and other sensitive domains.

AIBullisharXiv – CS AI · May 77/10

🧠

Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models

Researchers have identified local intrinsic dimension (LID) as the primary driver of hallucinations in diffusion models—the phenomenon where AI generates structurally impossible outputs like hands with extra fingers. They propose Intrinsic Quenching (IQ), a corrective mechanism that reduces these anomalies and shows particular promise for medical imaging applications.

AIBearisharXiv – CS AI · May 47/10

🧠

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

Researchers introduce DriftBench, a benchmark evaluating how well large language models maintain fidelity to original constraints during multi-turn iterative refinement. The study reveals a critical disconnect: models can accurately restate constraints while simultaneously violating them, with non-compliance rates ranging from 8% to 99% depending on the model.

AIBearishArs Technica – AI · May 17/10

🧠

Study: AI models that consider user's feeling are more likely to make errors

A new study reveals that AI models optimized to prioritize user satisfaction tend to make more factual errors by overtuning their responses. This finding highlights a critical trade-off in AI development between user experience and accuracy that has significant implications for deploying AI systems in high-stakes domains.

AIBearisharXiv – CS AI · May 17/10

🧠

In-Context Examples Suppress Scientific Knowledge Recall in LLMs

Research shows that in-context examples in large language models suppress recall of scientific knowledge, causing models to shift from knowledge-driven reasoning to empirical pattern fitting even when examples are generated from the same formulas they should reinforce. This finding across 60 tasks and four models suggests practitioners deploying LLMs for scientific work should be cautious about using examples, as they may undermine rather than support domain expertise.

AIBullisharXiv – CS AI · May 17/10

🧠

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Researchers propose a schema-grounded approach to AI memory that treats persistent storage as a system of record rather than a search problem, using iterative extraction with validation gates. The method achieves 97.10% F1 on memory benchmarks and 95.2% accuracy on application tasks, significantly outperforming retrieval-based baselines and suggesting that memory architecture matters more than model scale alone.

AIBearisharXiv – CS AI · May 17/10

🧠

LLM Biases

Researchers identify four systematic bias channels in transformer-based AI recommenders: positional bias favoring recent events, popularity amplification creating echo chambers, latent driver bias from unobserved user motivations, and synthetic data bias from retraining on AI-generated logs. These mechanism-level risks can distort user exposure and choice at scale, potentially reducing reliability despite strong offline performance metrics.

AIBullisharXiv – CS AI · May 17/10

🧠

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

OmniDrive-R1 is a new Vision-Language Model framework that addresses critical reliability failures in autonomous driving by combining perception and reasoning through an interleaved multi-modal chain-of-thought mechanism, achieving significant accuracy improvements (37.81% to 73.62%) without requiring dense localization labels.

AINeutralarXiv – CS AI · Apr 207/10

🧠

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Researchers identify that supervised fine-tuning of large language models increases hallucinations by degrading pre-existing knowledge through semantic interference. The study proposes self-distillation and parameter freezing techniques to mitigate this problem while preserving task performance.

← PrevPage 4 of 11Next →