74 articles tagged with #llm-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers propose Budget-Aware Value Tree (BAVT), a training-free framework that improves LLM agent efficiency by intelligently managing computational resources during multi-hop reasoning tasks. The system outperforms traditional approaches while using 4x fewer resources, demonstrating that smart budget management beats brute-force compute scaling.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers introduce the AI Search Paradigm, a comprehensive framework for next-generation search systems using four LLM-powered agents (Master, Planner, Executor, Writer) that collaborate to handle everything from simple queries to complex reasoning tasks. The system employs modular architecture with dynamic workflows for task planning, tool integration, and content synthesis to create more adaptive and scalable AI search capabilities.
AINeutralarXiv – CS AI · Mar 127/10
🧠Researchers propose treating multi-agent AI memory as a computer architecture problem, introducing a three-layer memory hierarchy and identifying critical protocol gaps. The paper highlights multi-agent memory consistency as the most pressing challenge for building scalable collaborative AI systems.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers introduced HCAPO, a new framework that uses hindsight credit assignment to improve Large Language Model agents' performance in long-horizon tasks. The system leverages LLMs as post-hoc critics to refine decision-making, achieving 7.7% and 13.8% improvements over existing methods on WebShop and ALFWorld benchmarks respectively.
AINeutralarXiv – CS AI · Mar 97/10
🧠Researchers present a new framework for uncertainty quantification in AI agents, highlighting critical gaps in current research that focuses on single-turn interactions rather than complex multi-step agent deployments. The paper identifies four key technical challenges and proposes foundations for safer AI agent systems in real-world applications.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers have developed AriadneMem, a new memory system for long-horizon LLM agents that addresses challenges in maintaining accurate memory under fixed context budgets. The system uses a two-phase pipeline with entropy-aware gating and conflict-aware coarsening to improve multi-hop reasoning while reducing runtime by 77.8% and using only 497 context tokens.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers developed AutoHarness, a technique where smaller LLMs like Gemini-2.5-Flash can automatically generate code harnesses to prevent illegal moves in games, outperforming larger models like Gemini-2.5-Pro and GPT-5.2-High. The method eliminates 78% of failures attributed to illegal moves in chess competitions and demonstrates superior performance across 145 different games.
🧠 Gemini
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers propose MAGE, a meta-reinforcement learning framework that enables Large Language Model agents to strategically explore and exploit in multi-agent environments. The framework uses multi-episode training with interaction histories and reflections, showing superior performance compared to existing baselines and strong generalization to unseen opponents.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers propose PlugMem, a task-agnostic plugin memory module for LLM agents that structures episodic memories into knowledge-centric graphs for efficient retrieval. The system consistently outperforms existing memory designs across multiple benchmarks while maintaining transferability between different tasks.
AINeutralarXiv – CS AI · Mar 56/10
🧠Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.
AIBearisharXiv – CS AI · Mar 47/103
🧠Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm for LLM agents that uses reinforcement learning to generate context-aware privacy guidance during execution. The approach achieves 94.2% privacy preservation while maintaining 80.6% helpfulness, outperforming static defense methods.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.
AIBearisharXiv – CS AI · Mar 47/102
🧠Researchers introduce Procedure-Aware Evaluation (PAE) framework to assess how AI agents complete tasks, not just if they succeed. The study reveals that 27-78% of reported AI agent successes are actually "corrupt successes" that mask underlying procedural violations and reliability issues.
AINeutralarXiv – CS AI · Mar 37/105
🧠Researchers introduce 'agentic unlearning' through Synchronized Backflow Unlearning (SBU), a framework that removes sensitive information from both AI model parameters and persistent memory systems. The method addresses critical gaps in existing unlearning techniques by preventing cross-pathway recontamination between memory and parameters.
AIBullisharXiv – CS AI · Feb 277/104
🧠Researchers have developed AgentSentry, a novel defense framework that protects AI agents from indirect prompt injection attacks by detecting and mitigating malicious control attempts in real-time. The system achieved 74.55% utility under attack, significantly outperforming existing defenses by 20-33 percentage points while maintaining benign performance.
AINeutralarXiv – CS AI · 1d ago6/10
🧠Researchers introduce ECHO, a reinforcement learning framework that co-evolves policy and critic models to address the problem of stale feedback in LLM agent training. The system uses cascaded rollouts and saturation-aware gain shaping to maintain synchronized, relevant critique as the agent's behavior improves over time, demonstrating enhanced stability and success rates in complex environments.
AINeutralarXiv – CS AI · 1d ago6/10
🧠Researchers demonstrated that memory length in LLM-based multi-agent systems produces contradictory effects on cooperation depending on the model used: Gemini showed suppressed cooperation with longer memory, while Gemma exhibited enhanced cooperation. The findings suggest model-specific characteristics and alignment mechanisms fundamentally shape emergent social behaviors in AI agent systems.
🧠 Gemini
AINeutralarXiv – CS AI · 1d ago6/10
🧠Researchers introduce Spatial Atlas, a compute-grounded reasoning system that combines deterministic spatial computation with large language models to create spatial-aware research agents. The framework demonstrates competitive performance on two benchmarks—FieldWorkArena for multimodal spatial question-answering and MLE-Bench for machine learning competitions—while improving interpretability by grounding reasoning in structured spatial scene graphs rather than relying on hallucinated outputs.
🏢 OpenAI🏢 Anthropic
AINeutralarXiv – CS AI · 1d ago6/10
🧠Researchers introduce a new behavioral measurement framework for tool-augmented language models deployed in organizations, using a two-dimensional Action Rate and Refusal Signal space to profile how LLM agents execute tasks under different autonomy configurations and risk contexts. The approach prioritizes execution-layer characterization over aggregate safety scoring, revealing that reflection-based scaffolding systematically shifts agent behavior in high-risk scenarios.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Doctoral research proposes a systematic framework for multi-agent LLM pair programming that improves code reliability and auditability through externalized intent and iterative validation. The study addresses critical gaps in how AI coding agents can produce trustworthy outputs aligned with developer objectives across testing, implementation, and maintenance workflows.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers introduce Skill-SD, a novel training framework for multi-turn LLM agents that improves sample efficiency by converting successful agent trajectories into dynamic natural language skills that condition a teacher model. The approach combines reinforcement learning with self-distillation and achieves significant performance improvements over baseline methods on benchmark tasks.
AINeutralarXiv – CS AI · 2d ago6/10
🧠ClawVM is a virtual memory management system designed for stateful LLM agents that addresses critical failures in current context window management. The system implements typed pages, multi-resolution representations, and validated writeback protocols to ensure deterministic state residency and durability, adding minimal computational overhead.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce Agent^2 RL-Bench, a benchmark testing whether LLM agents can autonomously design and execute reinforcement learning pipelines to improve foundation models. Testing across multiple agent systems reveals significant performance variation, with online RL succeeding primarily on ALFWorld while supervised learning pipelines dominate under fixed computational budgets.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers propose SGH (Structured Graph Harness), a framework that replaces iterative Agent Loops with explicit directed acyclic graphs (DAGs) for LLM agent execution. The approach addresses structural weaknesses in current agent design by enforcing immutable execution plans, separating planning from recovery, and implementing strict escalation protocols, trading some flexibility for improved controllability and verifiability.