🧠 AI⚪ NeutralImportance 6/10

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

arXiv – CS AI|Bowen Qin|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce LogDx-CI, a benchmark comparing 11 log-reduction tools for debugging CI failures using LLMs, finding that hybrid grep+tail routers achieve the best cost-quality tradeoff while agent-loop systems can recover from weak contexts through iterative tool calls, though at higher computational cost.

Analysis

LogDx-CI addresses a fundamental infrastructure problem in AI-assisted debugging: how to efficiently compress massive CI failure logs (up to 200k lines) without losing diagnostic signal. The benchmark's methodology is rigorous, testing 11 reduction strategies across 35 real GitHub Actions failures and evaluating them with multiple LLM families, creating a foundation for production systems that must balance cost, latency, and accuracy.

The research reveals an important architectural insight about the agentic paradigm. Single-shot LLM diagnosis shows wide performance variance (0.42 spread) depending on log reduction quality, but agent systems narrow this dramatically to 0.059 spread by enabling iterative refinement. However, this recovery mechanism comes with hidden costs—weak initial contexts force agents to issue 2-4x more tool calls. This suggests organizations cannot simply deploy any reduction tool and rely on agent iteration; upstream efficiency directly impacts downstream resource consumption.

Cross-family model pairing (GPT-5-mini summarizer with Claude Haiku debugger) outperforms same-family combinations, contradicting assumptions that models perform best when working with outputs from their own family. This finding has implications for AI stack design, suggesting diversity in model composition may be underutilized in production systems.

For practitioners building CI/CD debugging systems, the results indicate that sophisticated ML-based summarizers may not justify their 10x cost premium over simpler hybrid approaches. The research provides quantitative justification for tool selection decisions that typically remain empirical and team-specific. Broader adoption of benchmarking frameworks like this could standardize log reduction practices across the industry.

Key Takeaways

→Hybrid grep+tail routers achieve optimal cost-quality balance, matching complex summarizers at 4.5x token reduction and $0.03 per case
→Agent-loop systems collapse quality variance 7x compared to single-shot evaluation, but weak contexts require 2-4x more tool calls for recovery
→Cross-family model pairing (GPT-5-mini + Claude Haiku) beats same-family combinations by 0.071 points, suggesting composition diversity matters
→GPT-5-mini summarizer delivers best agent-loop performance (0.749 score) with 10x lower cost than Claude Haiku ($0.18 vs $1.75)
→All benchmark data, code, and reproducibility infrastructure are publicly available for peer validation

Mentioned in AI

Companies

OpenAI→

Models

GPT-5OpenAI

ClaudeAnthropic

HaikuAnthropic

SonnetAnthropic

#log-reduction #ci-debugging #llm-benchmarking #agent-systems #infrastructure #model-evaluation #cost-optimization

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.