#ai-reliability News & Analysis

154 articles tagged with #ai-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

154 articles

AINeutralarXiv – CS AI · May 127/10

🧠

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Researchers propose an outcome evidence reporting layer to improve the reliability of interactive agent benchmarks by explicitly tracking which runs have sufficient evidence of success versus uncertain cases. The framework evaluates five major AI benchmarks and reveals that surface-level outcome checks often fail to verify whether agents actually achieved intended results, making reported scores potentially misleading.

AIBearisharXiv – CS AI · May 127/10

🧠

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

Researchers introduce EnvTrustBench, a benchmarking framework that identifies evidence-grounding defects (EGDs) in LLM agents—failures where agents act on stale, incorrect, or malicious environmental data without verification. Testing across 6 LLM backbones and 5 agent scaffolds reveals consistent vulnerabilities, exposing a critical reliability gap in agent systems that increasingly interact with real-world APIs, files, and logs.

AIBullisharXiv – CS AI · May 127/10

🧠

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Researchers identify a fundamental geometric flaw in decoder-based Vision-Language Models where visual embeddings become over-aligned with linguistic patterns, causing systematic hallucinations. The study introduces quantitative methods to characterize this bias and proposes training-free and fine-tuning solutions that reduce hallucinations across multiple benchmarks without computational overhead.

AIBullisharXiv – CS AI · May 127/10

🧠

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Researchers present PROBE, a framework that improves how AI software engineering agents recover from failures by converting runtime telemetry into structured diagnoses and bounded recovery guidance. The system achieves 65% diagnosis accuracy and 21.8% recovery rates on previously unresolved cases, with a prototype deployed at Microsoft showing practical viability without disrupting existing workflows.

AIBearisharXiv – CS AI · May 117/10

🧠

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

Researchers auditing 2.5 million scientific papers found 146,932 hallucinated citations in 2025 alone, with non-existent references surging sharply after LLM adoption. The errors concentrate in AI-heavy fields and papers with linguistic signatures of AI assistance, while current journal moderation fails to catch most instances, threatening scientific integrity and reinforcing existing biases in academic credit attribution.

AINeutralarXiv – CS AI · May 117/10

🧠

Tracing Uncertainty in Language Model "Reasoning"

Researchers have developed a method to predict whether language model reasoning traces produce correct answers by analyzing uncertainty profiles—patterns in model confidence across generated token sequences. The approach achieves 80.7% accuracy in detecting errors and can identify failures within the first few hundred tokens, providing insights into how LLMs actually perform reasoning tasks.

AIBullisharXiv – CS AI · May 97/10

🧠

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

Researchers introduce TACT, a technique using activation steering to detect and correct 'agent drift' in language model coding agents, where models either repeatedly reason over known information or issue tool calls without proper reasoning. The method improves task resolution rates by 4.8-5.8 percentage points across multiple benchmarks while reducing steps needed to complete tasks by up to 26%.

AIBearisharXiv – CS AI · May 97/10

🧠

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Researchers found that large language models frequently arrive at correct code predictions through flawed reasoning, with performance dropping up to 70% when code undergoes semantics-preserving mutations. The study reveals substantial gaps between apparent accuracy and genuine semantic understanding, questioning the reliability of LLMs for critical programming tasks.

AINeutralarXiv – CS AI · May 97/10

🧠

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

A systematic review of 114 studies reveals that code quality defects in large language models stem primarily from training data imperfections rather than model limitations alone. The research establishes a taxonomy linking 18 propagation mechanisms between data quality issues and generated code failures, while advocating for proactive data governance over reactive post-generation filtering.

AINeutralarXiv – CS AI · May 97/10

🧠

On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

Researchers demonstrate that standard fine-tuning of transformer models on causal reasoning tasks causes catastrophic collapse where models learn trivial solutions while appearing accurate. They propose a semantic loss function with graph-based constraints that prevents collapse and achieves stable, context-dependent causal reasoning with 42.7% improvement over baseline models.

AIBullisharXiv – CS AI · May 77/10

🧠

Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models

Researchers have identified local intrinsic dimension (LID) as the primary driver of hallucinations in diffusion models—the phenomenon where AI generates structurally impossible outputs like hands with extra fingers. They propose Intrinsic Quenching (IQ), a corrective mechanism that reduces these anomalies and shows particular promise for medical imaging applications.

AIBearisharXiv – CS AI · May 77/10

🧠

Seeing the Goal, Missing the Truth: Human Accountability for AI Bias

Research shows that Large Language Models exhibit measurable bias when their downstream purpose is revealed, even when generating supposedly task-independent metrics. This bias stems from human research design choices rather than algorithmic flaws, raising critical questions about how AI systems are deployed in financial and other sensitive domains.

AIBearisharXiv – CS AI · May 47/10

🧠

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

Researchers introduce DriftBench, a benchmark evaluating how well large language models maintain fidelity to original constraints during multi-turn iterative refinement. The study reveals a critical disconnect: models can accurately restate constraints while simultaneously violating them, with non-compliance rates ranging from 8% to 99% depending on the model.

AIBearishArs Technica – AI · May 17/10

🧠

Study: AI models that consider user's feeling are more likely to make errors

A new study reveals that AI models optimized to prioritize user satisfaction tend to make more factual errors by overtuning their responses. This finding highlights a critical trade-off in AI development between user experience and accuracy that has significant implications for deploying AI systems in high-stakes domains.

AIBearisharXiv – CS AI · May 17/10

🧠

In-Context Examples Suppress Scientific Knowledge Recall in LLMs

Research shows that in-context examples in large language models suppress recall of scientific knowledge, causing models to shift from knowledge-driven reasoning to empirical pattern fitting even when examples are generated from the same formulas they should reinforce. This finding across 60 tasks and four models suggests practitioners deploying LLMs for scientific work should be cautious about using examples, as they may undermine rather than support domain expertise.

AIBullisharXiv – CS AI · May 17/10

🧠

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Researchers propose a schema-grounded approach to AI memory that treats persistent storage as a system of record rather than a search problem, using iterative extraction with validation gates. The method achieves 97.10% F1 on memory benchmarks and 95.2% accuracy on application tasks, significantly outperforming retrieval-based baselines and suggesting that memory architecture matters more than model scale alone.

AIBullisharXiv – CS AI · May 17/10

🧠

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

OmniDrive-R1 is a new Vision-Language Model framework that addresses critical reliability failures in autonomous driving by combining perception and reasoning through an interleaved multi-modal chain-of-thought mechanism, achieving significant accuracy improvements (37.81% to 73.62%) without requiring dense localization labels.

AIBearisharXiv – CS AI · May 17/10

🧠

LLM Biases

Researchers identify four systematic bias channels in transformer-based AI recommenders: positional bias favoring recent events, popularity amplification creating echo chambers, latent driver bias from unobserved user motivations, and synthetic data bias from retraining on AI-generated logs. These mechanism-level risks can distort user exposure and choice at scale, potentially reducing reliability despite strong offline performance metrics.

AIBullisharXiv – CS AI · Apr 207/10

🧠

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Researchers introduce AgentV-RL, an agentic verifier framework that enhances reward modeling for large language models by combining bidirectional reasoning agents with tool-use capabilities. The system addresses critical limitations in LLM verification by enabling forward and backward tracing of solutions, achieving 25.2% performance gains over existing methods and positioning agentic reward modeling as a promising new paradigm.

AINeutralarXiv – CS AI · Apr 207/10

🧠

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Researchers identify that supervised fine-tuning of large language models increases hallucinations by degrading pre-existing knowledge through semantic interference. The study proposes self-distillation and parameter freezing techniques to mitigate this problem while preserving task performance.

AIBearisharXiv – CS AI · Apr 207/10

🧠

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

Researchers found that large language models assigned personas exhibit motivated reasoning similar to humans, with up to 9% reduced accuracy in detecting misinformation and political personas being 90% more likely to evaluate scientific evidence favorably when it aligns with their induced identity. Standard debiasing prompts prove ineffective at mitigating these biases, raising concerns about LLMs amplifying identity-driven reasoning.

AIBearisharXiv – CS AI · Apr 157/10

🧠

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.

AIBearisharXiv – CS AI · Apr 157/10

🧠

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Researchers demonstrate that instruction-tuned large language models suffer severe performance degradation when subject to simple lexical constraints like banning a single punctuation mark or common word, losing 14-48% of response quality. This fragility stems from a planning failure where models couple task competence to narrow surface-form templates, affecting both open-weight and commercially deployed closed-weight models like GPT-4o-mini.

🧠 GPT-4

AINeutralarXiv – CS AI · Apr 157/10

🧠

Distorted or Fabricated? A Survey on Hallucination in Video LLMs

Researchers have conducted a comprehensive survey on hallucinations in Video Large Language Models (Vid-LLMs), identifying two core types—dynamic distortion and content fabrication—and their root causes in temporal representation limitations and insufficient visual grounding. The study reviews evaluation benchmarks, mitigation strategies, and proposes future directions including motion-aware encoders and counterfactual learning to improve reliability.

AIBearisharXiv – CS AI · Apr 157/10

🧠

Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

Researchers empirically evaluated 450 LLM-generated Python scripts for construction safety and found alarming reliability gaps, including a 45% silent failure rate where code executes but produces mathematically incorrect safety outputs. The study demonstrates that current frontier LLMs lack the deterministic rigor required for autonomous safety-critical engineering applications, necessitating human oversight and governance frameworks.

🧠 GPT-4🧠 Claude🧠 Gemini

← PrevPage 2 of 7Next →