y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

arXiv – CS AI|Xinyu Guan, Qianyang Zhao, Yuming Deng|
🤖AI Summary

Researchers introduce CICL, a decision-aware context layer that improves how language model agents select and compress relevant information for tool use. By scoring evidence based on action criticality and packing high-utility data as typed memory cards, the system achieves significant performance gains on code retrieval benchmarks, raising hit rates from 58% to 78% on SWE-bench tasks.

Analysis

CICL addresses a fundamental limitation in tool-using LLM agents: the failure to surface decisive evidence at decision time, even when relevant information exists in the broader context. The system operates as a measurement and selection layer that converts raw evidence into a structured context graph, routing judgments through multiple model architectures (Qwen, Opus, GPT variants) to identify which information actually drives better agent actions. This separates the decision signal from any single judge model, enabling reproducible comparison between frontier models and lightweight surrogates under a unified protocol.

The technical contribution emerges from the broader challenge of context window management in increasingly capable AI systems. As LLM agents tackle complex tasks like software engineering, they must navigate massive code repositories and documentation—too large to fit in context. CICL's scoring mechanism evaluates evidence by action shift, outcome uplift, necessity, and negative-transfer risk, then packs only the highest-utility units as typed memory cards with a hard budget constraint. This principled approach differs from naive retrieval reranking.

Empirical results demonstrate concrete improvements on real benchmarks: on 50 SWE-bench file-retrieval instances, Qwen-based reranking increased mean reciprocal rank from 0.634 to 0.790. Controlled diagnostics reveal action-criticality patterns—removing the top semantic unit collapses F1 to zero, validating that the system identifies genuinely decision-critical evidence rather than statistically correlated noise.

The authors explicitly position this as a measurement and selection contribution rather than claiming end-to-end agent repair. This intellectual honesty strengthens the work's credibility. The system contributes to the ongoing research challenge of making language models more reliable at complex multi-step reasoning tasks where information retrieval meets decision-making.

Key Takeaways
  • CICL improves file retrieval hit rates from 58% to 78% by scoring evidence based on decision criticality rather than surface relevance alone
  • The system uses a shared eight-field schema to enable reproducible comparison between frontier models and lightweight rankers on decision-aware context selection
  • Ablation studies show action-critical evidence drives performance—removing top-utility units collapses F1 to zero, validating the scoring mechanism
  • The approach separates decision signals from judge models, allowing frontier annotation, local surrogates, and compact rankers to be evaluated under one auditable protocol
  • Authors acknowledge limitations and position this as a measurement layer for context selection, not a complete solution to LLM agent reliability
Mentioned in AI
Models
GPT-5OpenAI
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles