#debugging News & Analysis

11 articles tagged with #debugging. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles

AINeutralarXiv – CS AI · Jun 236/10

🧠

TraceView: Interactive Visualization of Agentic Program Repair Trajectories

TraceView is an interactive visualization tool that helps developers understand and diagnose how LLM-based automated program repair agents work through their reasoning processes. By organizing agent trajectories into visual graphs with labeled components, the tool addresses a critical gap in debugging agent failures and improving repair outcomes.

AINeutralarXiv – CS AI · Jun 96/10

🧠

REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

REFLECT is a new method for identifying errors in long reasoning traces produced by LLM agents, particularly addressing the challenging "silent failure" problem where outputs appear plausible but are incorrect. The approach improves upon existing error-localization techniques by using controlled replay and contrastive evidence to refine error attribution, achieving higher accuracy across multiple benchmarks without requiring ground-truth answers.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

Researchers present Causal Agent Replay (CAR), a new method for diagnosing why large language model agents fail by identifying which decision step caused a failure rather than just which action executed it. Using structural causal models and intervention-based analysis, CAR achieves significantly higher attribution accuracy than existing LLM-judge approaches and provides confidence-bounded explanations for agent failures.

AINeutralarXiv – CS AI · Jun 26/10

🧠

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

Researchers introduce FALAT, a diagnostic framework that traces failures in LLM-based agent systems by analyzing dependencies across multi-step trajectories. The system identifies which agent caused a failure and which specific step introduced the decisive error, achieving 46% accuracy on algorithm-generated test cases.

AINeutralarXiv – CS AI · May 16/10

🧠

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

Researchers introduce RSCB-MC, a risk-sensitive contextual bandit system that improves how LLM-based coding agents decide whether to use external memory for debugging tasks. Rather than treating memory retrieval as a simple similarity-matching problem, the system treats it as a safety-critical control problem, achieving 62.5% success rate with zero false positives in testing.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Towards a Neural Debugger for Python

Researchers have developed neural debuggers - AI models that can emulate traditional Python debuggers by stepping through code execution, setting breakpoints, and predicting both forward and backward program states. This breakthrough enables more interactive control over neural code interpretation compared to existing approaches that only execute programs linearly.

🏢 Meta

AIBullisharXiv – CS AI · Mar 96/10

🧠

XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights

Researchers developed an explainable AI (XAI) system that transforms raw execution traces from LLM-based coding agents into structured, human-interpretable explanations. The system enables users to identify failure root causes 2.8 times faster and propose fixes with 73% higher accuracy through domain-specific failure taxonomy, automatic annotation, and hybrid explanation generation.

AIBullisharXiv – CS AI · Mar 26/1023

🧠

From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Researchers introduce CHIEF, a new framework that improves failure analysis in LLM-powered multi-agent systems by transforming execution logs into hierarchical causal graphs. The system uses oracle-guided backtracking and counterfactual attribution to better identify root causes of failures, outperforming existing methods on benchmark tests.

AINeutralarXiv – CS AI · Mar 25/107

🧠

User Misconceptions of LLM-Based Conversational Programming Assistants

Researchers analyzed user misconceptions about LLM-based programming assistants like ChatGPT, finding users often have misplaced expectations about web access, code execution, and debugging capabilities. The study examined Python programming conversations from WildChat dataset and identified the need for clearer communication of tool capabilities to prevent over-reliance and unproductive practices.

AINeutralarXiv – CS AI · Feb 274/106

🧠

A Reversible Semantics for Janus

Researchers present a new reversible small-step semantics for Janus, a paradigmatic reversible programming language. The novel approach solves the problem of information loss during forward computation while maintaining equivalence to previous semantics.

AINeutralSynced Review · Aug 144/108

🧠

Which Agent Causes Task Failures and When?Researchers from PSU and Duke explores automated failure attribution of LLM Multi-Agent Systems

Researchers from Penn State University and Duke University are exploring automated failure attribution in LLM Multi-Agent Systems to identify which agents cause task failures and when. The study addresses a common issue where multi-agent systems fail to complete tasks despite high activity levels, aiming to improve system reliability and debugging.