y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-agents News & Analysis

74 articles tagged with #llm-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

74 articles
AIBullisharXiv – CS AI · Mar 167/10
🧠

Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents

Researchers propose Budget-Aware Value Tree (BAVT), a training-free framework that improves LLM agent efficiency by intelligently managing computational resources during multi-hop reasoning tasks. The system outperforms traditional approaches while using 4x fewer resources, demonstrating that smart budget management beats brute-force compute scaling.

AIBullisharXiv – CS AI · Mar 167/10
🧠

Towards AI Search Paradigm

Researchers introduce the AI Search Paradigm, a comprehensive framework for next-generation search systems using four LLM-powered agents (Master, Planner, Executor, Writer) that collaborate to handle everything from simple queries to complex reasoning tasks. The system employs modular architecture with dynamic workflows for task planning, tool integration, and content synthesis to create more adaptive and scalable AI search capabilities.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Hindsight Credit Assignment for Long-Horizon LLM Agents

Researchers introduced HCAPO, a new framework that uses hindsight credit assignment to improve Large Language Model agents' performance in long-horizon tasks. The system leverages LLMs as post-hoc critics to refine decision-making, achieving 7.7% and 13.8% improvements over existing methods on WebShop and ALFWorld benchmarks respectively.

AINeutralarXiv – CS AI · Mar 97/10
🧠

Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities

Researchers present a new framework for uncertainty quantification in AI agents, highlighting critical gaps in current research that focuses on single-turn interactions rather than complex multi-step agent deployments. The paper identifies four key technical challenges and proposes foundations for safer AI agent systems in real-world applications.

AIBullisharXiv – CS AI · Mar 56/10
🧠

AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents

Researchers have developed AriadneMem, a new memory system for long-horizon LLM agents that addresses challenges in maintaining accurate memory under fixed context budgets. The system uses a two-phase pipeline with entropy-aware gating and conflict-aware coarsening to improve multi-hop reasoning while reducing runtime by 77.8% and using only 497 context tokens.

🧠 GPT-4
AIBullisharXiv – CS AI · Mar 57/10
🧠

AutoHarness: improving LLM agents by automatically synthesizing a code harness

Researchers developed AutoHarness, a technique where smaller LLMs like Gemini-2.5-Flash can automatically generate code harnesses to prevent illegal moves in games, outperforming larger models like Gemini-2.5-Pro and GPT-5.2-High. The method eliminates 78% of failures attributed to illegal moves in chess competitions and demonstrates superior performance across 145 different games.

🧠 Gemini
AIBullisharXiv – CS AI · Mar 56/10
🧠

MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

Researchers propose MAGE, a meta-reinforcement learning framework that enables Large Language Model agents to strategically explore and exploit in multi-agent environments. The framework uses multi-episode training with interaction histories and reflections, showing superior performance compared to existing baselines and strong generalization to unseen opponents.

AIBullisharXiv – CS AI · Mar 56/10
🧠

PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents

Researchers propose PlugMem, a task-agnostic plugin memory module for LLM agents that structures episodic memories into knowledge-centric graphs for efficient retrieval. The system consistently outperforms existing memory designs across multiple benchmarks while maintaining transferability between different tasks.

AINeutralarXiv – CS AI · Mar 56/10
🧠

WebDS: An End-to-End Benchmark for Web-based Data Science

Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.

AIBearisharXiv – CS AI · Mar 47/103
🧠

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.

AIBullisharXiv – CS AI · Mar 47/103
🧠

Contextualized Privacy Defense for LLM Agents

Researchers propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm for LLM agents that uses reinforcement learning to generate context-aware privacy guidance during execution. The approach achieves 94.2% privacy preservation while maintaining 80.6% helpfulness, outperforming static defense methods.

AIBullisharXiv – CS AI · Mar 47/103
🧠

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.

AINeutralarXiv – CS AI · Mar 37/105
🧠

Agentic Unlearning: When LLM Agent Meets Machine Unlearning

Researchers introduce 'agentic unlearning' through Synchronized Backflow Unlearning (SBU), a framework that removes sensitive information from both AI model parameters and persistent memory systems. The method addresses critical gaps in existing unlearning techniques by preventing cross-pathway recontamination between memory and parameters.

AIBullisharXiv – CS AI · Feb 277/104
🧠

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Researchers have developed AgentSentry, a novel defense framework that protects AI agents from indirect prompt injection attacks by detecting and mitigating malicious control attempts in real-time. The system achieved 74.55% utility under attack, significantly outperforming existing defenses by 20-33 percentage points while maintaining benign performance.

AINeutralarXiv – CS AI · 1d ago6/10
🧠

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

Researchers introduce ECHO, a reinforcement learning framework that co-evolves policy and critic models to address the problem of stale feedback in LLM agent training. The system uses cascaded rollouts and saturation-aware gain shaping to maintain synchronized, relevant critique as the agent's behavior improves over time, demonstrating enhanced stability and success rates in complex environments.

AINeutralarXiv – CS AI · 1d ago6/10
🧠

How memory can affect collective and cooperative behaviors in an LLM-Based Social Particle Swarm

Researchers demonstrated that memory length in LLM-based multi-agent systems produces contradictory effects on cooperation depending on the model used: Gemini showed suppressed cooperation with longer memory, while Gemma exhibited enhanced cooperation. The findings suggest model-specific characteristics and alignment mechanisms fundamentally shape emergent social behaviors in AI agent systems.

🧠 Gemini
AINeutralarXiv – CS AI · 1d ago6/10
🧠

Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

Researchers introduce Spatial Atlas, a compute-grounded reasoning system that combines deterministic spatial computation with large language models to create spatial-aware research agents. The framework demonstrates competitive performance on two benchmarks—FieldWorkArena for multimodal spatial question-answering and MLE-Bench for machine learning competitions—while improving interpretability by grounding reasoning in structured spatial scene graphs rather than relying on hallucinated outputs.

🏢 OpenAI🏢 Anthropic
AINeutralarXiv – CS AI · 1d ago6/10
🧠

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Researchers introduce a new behavioral measurement framework for tool-augmented language models deployed in organizations, using a two-dimensional Action Rate and Refusal Signal space to profile how LLM agents execute tasks under different autonomy configurations and risk contexts. The approach prioritizes execution-layer characterization over aggregate safety scoring, revealing that reflection-based scaffolding systematically shifts agent behavior in high-risk scenarios.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

From Helpful to Trustworthy: LLM Agents for Pair Programming

Doctoral research proposes a systematic framework for multi-agent LLM pair programming that improves code reliability and auditability through externalized intent and iterative validation. The study addresses critical gaps in how AI coding agents can produce trustworthy outputs aligned with developer objectives across testing, implementation, and maintenance workflows.

AIBullisharXiv – CS AI · 2d ago6/10
🧠

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Researchers introduce Skill-SD, a novel training framework for multi-turn LLM agents that improves sample efficiency by converting successful agent trajectories into dynamic natural language skills that condition a teacher model. The approach combines reinforcement learning with self-distillation and achieves significant performance improvements over baseline methods on benchmark tasks.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

ClawVM is a virtual memory management system designed for stateful LLM agents that addresses critical failures in current context window management. The system implements typed pages, multi-resolution representations, and validated writeback protocols to ensure deterministic state residency and durability, adding minimal computational overhead.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Researchers introduce Agent^2 RL-Bench, a benchmark testing whether LLM agents can autonomously design and execute reinforcement learning pipelines to improve foundation models. Testing across multiple agent systems reveals significant performance variation, with online RL succeeding primarily on ALFWorld while supervised learning pipelines dominate under fixed computational budgets.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

Researchers propose SGH (Structured Graph Harness), a framework that replaces iterative Agent Loops with explicit directed acyclic graphs (DAGs) for LLM agent execution. The approach addresses structural weaknesses in current agent design by enforcing immutable execution plans, separating planning from recovery, and implementing strict escalation protocols, trading some flexibility for improved controllability and verifiability.

← PrevPage 2 of 3Next →