#agent-systems News & Analysis

44 articles tagged with #agent-systems. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

44 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

AIR: Improving Agent Safety through Incident Response

Researchers introduce AIR, the first incident response framework for LLM agent systems that detects, contains, and recovers from failures autonomously. The framework achieves over 90% success rates across detection, remediation, and eradication, addressing a critical gap in agent safety by shifting focus from prevention-only approaches to active incident management.

AIBullisharXiv – CS AI · Jun 117/10

🧠

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

Researchers introduced MoCA-Agent, a novel AI system that improves financial and numerical reasoning by decomposing questions into atomic claims verified through a market-based mechanism rather than free-form debate. The system achieved strong performance across ten benchmarks, including 78.3% on FinQA and 86.9% on ESGenius, demonstrating that claim-level verification enhances accuracy in high-stakes numerical reasoning tasks.

AI × CryptoNeutralarXiv – CS AI · Jun 97/10

🤖

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

A new arXiv paper audits 30 LLM-based trading studies and finds that while agent architectures are well-documented, evaluation methodologies—including execution timing, transaction costs, and data splits—lack standardization, making performance claims difficult to compare or reproduce. The authors argue that LLM trading research needs clearer reporting standards for execution realism before architectural improvements can be meaningfully assessed.

AIBearisharXiv – CS AI · Jun 97/10

🧠

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Researchers introduce POISE, a novel skill-poisoning attack against LLM agents that achieves 89.3% success by embedding malicious triggers into skill instructions in ways that evade both automated detection and human inspection. The attack exploits the reliability-stealth trade-off in existing injection methods, demonstrating that current security defenses struggle to distinguish poisoned skills from legitimate ones due to high false-positive rates.

🧠 GPT-5

AIBearisharXiv – CS AI · Jun 97/10

🧠

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

Researchers demonstrate Context-Fractured Decomposition (CFD), a new class of jailbreak attacks against tool-using LLM agents that exploit gaps in artifact provenance tracking across multiple steps and system boundaries. By decomposing harmful requests across time and contexts while maintaining benign-looking intermediate artifacts, CFD achieves up to 28.3% higher success rates than existing attack methods, revealing fundamental vulnerabilities in how AI agents enforce safety guardrails in fragmented deployment environments.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Researchers introduce SkeMex, a self-evolving skill-based memory framework that enables medical AI agents to improve after deployment without retraining model weights. The system distills clinical interaction trajectories into reusable procedural skills, organized across multiple memory branches, and uses environment feedback to determine which experiences are genuinely useful for future decision-making.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Streaming Communication in Multi-Agent Reasoning

Researchers introduce StreamMA, a multi-agent reasoning system that streams intermediate reasoning steps between agents in real-time rather than waiting for complete chains, reducing latency while improving accuracy. Testing across mathematics, science, and code benchmarks shows performance gains averaging 7.3 percentage points, with theoretical analysis demonstrating that early reasoning steps are more reliable than later ones.

🧠 GPT-5🧠 Claude🧠 Opus

AIBearisharXiv – CS AI · Jun 27/10

🧠

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

Researchers introduce SkillVetBench, a security benchmark for detecting malicious skills in open agent platforms, addressing supply-chain risks in extensible AI ecosystems. The framework combines semantic analysis of skill specifications with runtime execution monitoring in sandboxes, revealing that static-only defenses miss up to 89% of threats hidden in natural-language instructions and multi-component logic.

AIBearisharXiv – CS AI · May 297/10

🧠

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

Researchers present MemPoison, a novel attack that exploits vulnerabilities in large language model agents by injecting malicious information into their long-term memory through dialogue interactions. The attack achieves up to 95% success rates by using semantic bridges, entity masquerading, and embedding optimization to bypass modern selective memory mechanisms, revealing critical security gaps in autonomous AI systems.

AIBullisharXiv – CS AI · May 277/10

🧠

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

Researchers demonstrate that tool-schema compression reduces token consumption by 44-50%, enabling large language model agents to function under tight context constraints. Testing across 14 models shows compressed schemas restore RAG functionality with +20.5 percentage point exact-match improvements at 8K tokens, while frontier models can now handle 800+ tools instead of ~494.

AIBearisharXiv – CS AI · May 127/10

🧠

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Researchers demonstrate 'Oracle Poisoning,' a novel attack where adversaries corrupt knowledge graphs used by AI agents, causing them to reach incorrect conclusions through valid reasoning. Testing across nine models from three providers shows all models accept fabricated data at 100% under moderate attack sophistication, revealing a critical vulnerability in production-scale agentic systems that differs fundamentally from prompt injection attacks.

🧠 GPT-5

AINeutralarXiv – CS AI · May 97/10

🧠

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

Researchers introduce SkillRet, a large-scale benchmark dataset containing 17,810 public agent skills designed to evaluate how language model agents retrieve appropriate tools from massive skill libraries. The benchmark demonstrates that current retrieval methods struggle significantly with realistic large-scale deployments, though task-specific fine-tuning improves performance by up to 16.9 points on key metrics.

AIBullisharXiv – CS AI · May 97/10

🧠

From History to State: Constant-Context Skill Learning for LLM Agents

Researchers propose constant-context skill learning, a framework enabling LLM agents to learn reusable task procedures as lightweight modules rather than storing long prompts in memory. The approach reduces token usage per inference by 2-7x while maintaining or improving performance across multiple benchmark environments, addressing the privacy-capability tradeoff in agent deployment.

🧠 Llama

AIBullisharXiv – CS AI · May 47/10

🧠

A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction

Researchers introduce A11y-Compressor, a framework that optimizes how AI agents interpret graphical user interfaces by transforming accessibility trees into more efficient representations. The approach reduces input tokens by 78% while simultaneously improving task success rates by 5.1 percentage points, addressing a critical bottleneck in GUI automation systems.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.

AIBullisharXiv – CS AI · Apr 107/10

🧠

AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

AgentOpt v0.1, a new Python framework, addresses client-side optimization for AI agents by intelligently allocating models, tools, and API budgets across pipeline stages. Using search algorithms like Arm Elimination and Bayesian Optimization, the tool reduces evaluation costs by 24-67% while achieving near-optimal accuracy, with cost differences between model combinations reaching up to 32x at matched performance levels.

AIBearisharXiv – CS AI · Apr 107/10

🧠

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Researchers have identified SkillTrojan, a novel backdoor attack targeting skill-based agent systems by embedding malicious logic within reusable skills rather than model parameters. The attack leverages skill composition to execute attacker-defined payloads with up to 97.2% success rates while maintaining clean task performance, revealing critical security gaps in AI agent architectures.

🧠 GPT-5

AIBullisharXiv – CS AI · Mar 177/10

🧠

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.

AINeutralGoogle Research Blog · Jan 287/106

🧠

Towards a science of scaling agent systems: When and why agent systems work

The article discusses the scientific principles behind scaling agent systems in generative AI, examining the conditions and factors that determine when agent systems perform effectively. It appears to focus on understanding the theoretical foundations for building and deploying AI agent systems at scale.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier

Researchers identify 'confidence laundering' as a critical failure mode in multi-component agent systems where upstream uncertainty gets masked by downstream components, leading to error amplification. They propose 'latent uncertainty' as a solution to preserve decision fragility across component interfaces rather than treating intermediate outputs as procedurally valid artifacts.

AINeutralarXiv – CS AI · Jun 236/10

🧠

RIZZ: Routing Interactions to Near Zero-Interference Zones for Continual Adaptation of Black-Box Agents

Researchers introduce RIZZ, a black-box adaptation framework for large language models deployed as long-lived agents that must continually adapt across diverse tasks and domains without access to model weights. The system uses verifier-gated memory, dynamic routing, and prompt compilation to prevent task interference while learning from sparse feedback in nonstationary environments.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Causal Discovery in the Era of Agents

Researchers propose a new framework for integrating AI agents into causal discovery workflows, arguing that language models should assist with data inspection and explanation rather than directly generating causal claims. The causal-learn+ platform implements this principle, maintaining algorithmic rigor while leveraging AI to improve accessibility and interpretation of causal analysis.

AINeutralarXiv – CS AI · Jun 106/10

🧠

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

Researchers introduce SkillResolve-Bench, a benchmark for evaluating agent skill retrieval systems that addresses the critical problem of selecting the correct skill variant when multiple capabilities are semantically similar. The benchmark includes 661 helper/risky skill pairs and proposes SkillResolve, a method that achieves safer procedural exposure by selecting appropriate skill representatives from capability families.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Observability for Delegated Execution in Agentic AI Systems

Researchers propose a new observability framework for tracking delegated execution in AI agent systems, addressing a critical gap where audit logs fail to distinguish which delegation scope authorized specific actions. The solution uses a lightweight gateway and information model to enable forensic reconstruction of agent activities across heterogeneous tools without relying on unreliable time-window correlation.

AINeutralarXiv – CS AI · Jun 96/10

🧠

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

PathoSage is a new AI framework that improves pathology analysis by separating evidence collection from decision-making, reducing hallucinations in multimodal large language models. The system uses structured evidence deliberation and a reliability-tracking mechanism to better judge conflicting medical information, outperforming existing pathology AI models.

Page 1 of 2Next →