#failure-analysis News & Analysis

7 articles tagged with #failure-analysis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AINeutralarXiv – CS AI · Apr 157/10

🧠

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Researchers introduce HORIZON, a diagnostic benchmark for identifying and analyzing why large language model agents fail at long-horizon tasks requiring extended action sequences. By evaluating state-of-the-art models across multiple domains and proposing an LLM-as-a-Judge attribution pipeline, the study provides systematic methodology for understanding agent limitations and improving reliability.

🧠 GPT-5🧠 Claude

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Researchers introduce Drive-P2D, a comprehensive benchmark for evaluating vision-language models in autonomous driving that tests perception and decision-making across progressive complexity levels. The benchmark addresses gaps in existing evaluation methods by separating reasoning analysis from objective answer scoring and identifying specific failure modes that could improve VLM safety for real-world deployment.

AIBullisharXiv – CS AI · Apr 106/10

🧠

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

KITE is a training-free system that converts long robot execution videos into compact, interpretable tokens for vision-language models to analyze robot failures. The approach combines keyframe extraction, open-vocabulary detection, and bird's-eye-view spatial representations to enable failure detection, identification, localization, and correction without requiring model fine-tuning.

AINeutralarXiv – CS AI · Mar 96/10

🧠

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Researchers analyzed Vision-Language Models (VLMs) used in automated driving to understand why they fail on simple visual tasks. They identified two failure modes: perceptual failure where visual information isn't encoded, and cognitive failure where information is present but not properly aligned with language semantics.

AIBullisharXiv – CS AI · Mar 26/1023

🧠

From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Researchers introduce CHIEF, a new framework that improves failure analysis in LLM-powered multi-agent systems by transforming execution logs into hierarchical causal graphs. The system uses oracle-guided backtracking and counterfactual attribution to better identify root causes of failures, outperforming existing methods on benchmark tests.

AINeutralarXiv – CS AI · Mar 27/1014

🧠

Demystifying the Lifecycle of Failures in Platform-Orchestrated Agentic Workflows

Researchers present AgentFail, a dataset of 307 real-world failure cases from agentic workflow platforms, analyzing how multi-agent AI systems fail and can be repaired. The study reveals that failures in these low-code orchestrated AI workflows propagate differently than traditional software, making them harder to diagnose and fix.

AIBullishSynced Review · Jun 166/107

🧠

Researchers from PSU and Duke introduce “Multi-Agent Systems Automated Failure Attribution

Researchers from Pennsylvania State University and Duke University have introduced automated failure attribution for multi-agent systems, a methodology that transforms the complex process of identifying system failures and their causes into a quantifiable and analyzable problem. This development could significantly improve the debugging and accountability processes in multi-agent AI system development.