#agent-reliability News & Analysis

22 articles tagged with #agent-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration

Researchers introduce Parallel WebBench, a benchmark revealing critical failure modes in long-horizon web agents that produce confident but incomplete answers. Despite significant improvements in completion rates using GRPO training on synthetic data, agents still struggle with evidence grounding and synthesis accuracy, exposing gaps between appearing successful and actually solving tasks correctly.

🧠 GPT-4

AIBullisharXiv – CS AI · Jun 127/10

🧠

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Researchers introduce Evoflux, an inference-time evolutionary search method that significantly improves how compact language models handle tool use and workflow execution. By treating tool failures as a repair problem rather than a generation problem, Evoflux increases execution feasibility from 3% to 17-24% on complex multi-tool tasks, outperforming traditional fine-tuning approaches while maintaining cost efficiency.

AIBullisharXiv – CS AI · Jun 107/10

🧠

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

Researchers introduce STAGE-Claw, an automated framework for evaluating AI agents in realistic personal-computing environments by measuring actual system state changes rather than textual responses. The framework creates 40 benchmark tasks and evaluates 11 frontier models, addressing critical gaps in how large language model agents are currently assessed.

AIBullisharXiv – CS AI · Jun 107/10

🧠

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

Researchers introduce Activation Steering Adapter (ASA), a training-free method that improves LLM tool-calling reliability by intervening on mid-layer activations at inference time. The approach achieves significant performance gains on tool-use benchmarks without parameter updates, addressing a critical gap between what models internally represent and their actual behavior.

AIBullisharXiv – CS AI · Jun 97/10

🧠

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

SKILL.nb is a new framework that improves AI agent reliability by selectively formalizing workflow steps based on execution evidence, storing them as versioned notebooks with natural language guidance and executable code. The system achieved 53.7% success on web automation tasks and retained 91.7% performance across multiple re-executions, significantly outperforming existing baselines in handling environment drift and task specification changes.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

Researchers present CVT-RL, a reinforcement learning algorithm that addresses the problem of long-horizon language agents learning shortcuts and unsupported reasoning chains by introducing policy-conditioned counterfactual credit estimation and intervention-validity gating. The method achieves 78.9% task success and reduces measured hacking attempts from 7.2% to 3.9%, demonstrating measurable improvements in agent reliability and verifiability.

AINeutralarXiv – CS AI · May 297/10

🧠

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Researchers introduce OpenClawBench, a large-scale dataset of 31,264 annotated agent execution trajectories that reveals a significant gap between task success and process reliability. The study finds that 9.3% of oracle-passing executions contain process-side anomalies like unresolved ambiguities and unsafe operations, demonstrating that success metrics alone mask critical failure modes in AI agent systems.

AIBullisharXiv – CS AI · May 297/10

🧠

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

Researchers introduce GRASP, a method for improving large language model agents through controlled skill library updates that prevent performance regression. Tested across five base models on clinical benchmarks, GRASP achieves dramatic improvements (40.6% to 88.8% on MedAgentBench) while maintaining stability, outperforming existing self-improvement approaches by significant margins.

🧠 GPT-4🧠 GPT-5🧠 Gemini

AIBearisharXiv – CS AI · May 297/10

🧠

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Researchers discovered that reflexive AI agents systematically store confident but false interpretations of tasks in their memory, a phenomenon called memory confabulation, causing them to repeat incorrect behaviors even when environments reset. The study introduces a metric to detect this failure mode and proposes programmatic solutions that significantly improve agent performance and reduce reliance on false reflective content.

AIBullisharXiv – CS AI · May 287/10

🧠

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

ReflexGrad introduces a dual-process architecture enabling LLM agents to recover from failures within a single episode without requiring demonstrations. The system combines fast continuous refinement with slow causal diagnosis, achieving significant performance improvements on benchmark tasks with smaller models matching larger model performance through architectural innovation rather than scale.

🧠 GPT-5

AINeutralarXiv – CS AI · Apr 157/10

🧠

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Researchers introduce HORIZON, a diagnostic benchmark for identifying and analyzing why large language model agents fail at long-horizon tasks requiring extended action sequences. By evaluating state-of-the-art models across multiple domains and proposing an LLM-as-a-Judge attribution pipeline, the study provides systematic methodology for understanding agent limitations and improving reliability.

🧠 GPT-5🧠 Claude

AIBullisharXiv – CS AI · Apr 147/10

🧠

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Researchers introduce DiaFORGE, a three-stage framework for training LLMs to reliably invoke enterprise APIs by focusing on disambiguation between similar tools and underspecified arguments. Fine-tuned models achieved 27-49 percentage points higher tool-invocation success than GPT-4o and Claude-3.5-Sonnet, with an open corpus of 5,000 production-grade API specifications released for further research.

🧠 GPT-4🧠 Claude

AIBullisharXiv – CS AI · Mar 56/10

🧠

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Researchers propose a dual-helix governance framework to address AI agent reliability issues in WebGIS development, implementing a 3-track architecture that achieved 51% reduction in code complexity. The framework uses knowledge graphs and self-learning cycles to overcome LLM limitations like context constraints and instruction failures.

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Researchers identify 'premature commitment' as a hidden failure mode in LLM agents where models settle on an initial interpretation and defend it rather than adapting to new evidence. Using hidden-state analysis, they develop diagnostics that detect trajectory inconsistency with up to 97% accuracy and demonstrate that commitment is orthogonal to correctness—agents can be confidently wrong or right.

🧠 Llama

AINeutralarXiv – CS AI · Jun 96/10

🧠

Web Agents Should Use Typed Actions Instead of Click-Based Browsing

A research paper proposes replacing click-based web automation with typed actions backed by semantic APIs, arguing this shift would make AI agents more reliable, auditable, and cost-effective. The authors introduce 'web verbs' as a standardized interface for web operations that could improve agent behavior and enable trustworthy automation at scale.

AINeutralarXiv – CS AI · May 286/10

🧠

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Researchers introduce OR-Space, a comprehensive benchmark for evaluating large language model agents in industrial operations research workflows. Unlike existing benchmarks that focus on single-stage problem translation, OR-Space tests agents across persistent multi-artifact workspaces with three task modes—building optimization models, revising them under changing requirements, and explaining solutions—to assess real-world reliability and practical readiness.

AINeutralarXiv – CS AI · May 286/10

🧠

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Researchers propose FeasiGen, a framework for automatically generating infeasible task benchmarks to evaluate whether AI agents recognize when tasks cannot be completed with available tools. Testing across nine models reveals critical weaknesses, with agents continuing execution on impossible tasks up to 73.9% of the time, though multi-agent architectures show improved performance.

AIBullisharXiv – CS AI · May 276/10

🧠

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Researchers introduce NoisyAgent, a training framework that improves large language model agent robustness by deliberately exposing them to environmental imperfections during training. By simulating real-world interaction noise—including user ambiguity and tool failures—the approach bridges the gap between idealized benchmark performance and practical deployment reliability.

AINeutralarXiv – CS AI · May 126/10

🧠

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Researchers introduce VIGIL, an evaluation framework that separately measures whether embodied AI agents correctly complete tasks and properly report success, rather than conflating execution failures with commitment failures. Testing across 20 models reveals significant performance gaps in terminal commitment despite similar task execution, highlighting a critical blind spot in current AI agent benchmarking.

AINeutralarXiv – CS AI · May 96/10

🧠

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

Researchers analyzed internal mechanisms of LLM-based agent memory systems across the Qwen model family, discovering that routing circuits activate before content extraction circuits—a critical gap in small models. They developed an unsupervised diagnostic tool achieving 76.2% accuracy in identifying where silent memory failures occur, providing practical insights for improving agent reliability.

AINeutralarXiv – CS AI · Apr 146/10

🧠

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

ClawVM is a virtual memory management system designed for stateful LLM agents that addresses critical failures in current context window management. The system implements typed pages, multi-resolution representations, and validated writeback protocols to ensure deterministic state residency and durability, adding minimal computational overhead.

AIBullisharXiv – CS AI · Mar 27/1011

🧠

Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments

Researchers propose a new framework for foundation world models that enables autonomous agents to learn, verify, and adapt reliably in dynamic environments. The approach combines reinforcement learning with formal verification and adaptive abstraction to create agents that can synthesize verifiable programs and maintain correctness while adapting to novel conditions.