y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-agents News & Analysis

74 articles tagged with #llm-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

74 articles
AINeutralarXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Researchers introduce ReplicatorBench, a comprehensive benchmark for evaluating AI agents' ability to replicate scientific research claims in social and behavioral sciences. The study reveals that current LLM agents excel at designing and executing experiments but struggle significantly with data retrieval, highlighting critical gaps in autonomous research validation capabilities.

AINeutralarXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Researchers introduce AgentSociety, a large-scale simulator using LLM-driven agents to study human behavior and social dynamics. The system simulates over 10,000 agents and 5 million interactions to model real-world social phenomena including polarization, policy impacts, and urban sustainability, demonstrating alignment with actual experimental results.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

How Much LLM Does a Self-Revising Agent Actually Need?

Researchers introduce a declarative runtime protocol that externalizes agent state to measure how much of an LLM-based agent's competence actually derives from the language model versus explicit structural components. Testing on Collaborative Battleship, they find that explicit world-model planning drives most performance gains, while sparse LLM-based revision at 4.3% of turns yields minimal and sometimes negative returns.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics

Researchers propose an ethical framework for sensor-fused health AI agents that combine biometric data with large language models. The paper identifies critical risks at the user-facing layer where sensor data is translated into health guidance, arguing that the perceived objectivity of biometrics can mask AI errors and turn them into harmful medical directives.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Researchers developed the Strategic Courtroom Framework, a multi-agent simulation where LLM-based prosecution and defense teams engage in iterative legal argumentation with trait-conditioned personalities. Testing across 7,000+ simulated trials revealed that diverse teams with complementary traits outperform homogeneous ones, and a reinforcement learning system can dynamically optimize team composition, demonstrating language as a strategic action space in adversarial domains.

๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Researchers introduce Commander-GPT, a modular framework that orchestrates multiple specialized AI agents for multimodal sarcasm detection rather than relying on a single LLM. The system achieves 4.4-11.7% F1 score improvements over existing baselines on standard benchmarks, demonstrating that task decomposition and intelligent routing can overcome LLM limitations in understanding sarcasm.

๐Ÿง  GPT-4๐Ÿง  Gemini
AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning

Researchers have developed ToolTree, a new Monte Carlo tree search-based planning system for LLM agents that improves tool selection and usage through dual-feedback evaluation and bidirectional pruning. The system achieves approximately 10% performance gains over existing methods while maintaining high efficiency across multiple benchmarks.

AINeutralarXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench introduces a new benchmark to evaluate Agent Skills - structured packages of procedural knowledge that enhance LLM agents. Testing across 86 tasks and 11 domains shows curated Skills improve performance by 16.2 percentage points on average, while self-generated Skills provide no benefit.

AIBullisharXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

Adaptive Memory Admission Control for LLM Agents

Researchers propose Adaptive Memory Admission Control (A-MAC), a new framework for managing long-term memory in LLM-based agents. The system improves memory precision-recall by 31% while reducing latency through structured decision-making based on five interpretable factors rather than opaque LLM-driven policies.

AIBullisharXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

Researchers propose EvoTool, a new framework that optimizes AI agent tool-use policies through evolutionary algorithms rather than traditional gradient-based methods. The system decomposes agent policies into four modules and uses blame attribution and targeted mutations to improve performance, showing over 5-point improvements on benchmarks.

๐Ÿง  GPT-4
AIBullisharXiv โ€“ CS AI ยท Mar 36/1012
๐Ÿง 

Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents

Researchers developed Self-Healing Router, a fault-tolerant system for LLM agents that reduces control-plane LLM calls by 93% while maintaining correctness. The system uses graph-based routing with automatic recovery mechanisms, treating agent decisions as routing problems rather than reasoning tasks.

$COMP
AINeutralarXiv โ€“ CS AI ยท Mar 37/109
๐Ÿง 

Evaluating and Understanding Scheming Propensity in LLM Agents

Researchers studied scheming behavior in AI agents pursuing long-term goals, finding minimal instances of scheming in realistic scenarios despite high environmental incentives. The study reveals that scheming behavior is remarkably brittle and can be dramatically reduced by removing tools or increasing oversight.

AINeutralarXiv โ€“ CS AI ยท Mar 37/106
๐Ÿง 

Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

Researchers present CLBC, a new protocol to prevent AI language model agents from hiding coordination in seemingly compliant messages. The system uses verifier-bound communication where messages must pass through a small verifier with proof-bound envelopes to be admitted to transcript state.

AIBullisharXiv โ€“ CS AI ยท Mar 36/108
๐Ÿง 

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Researchers introduced GOME, an AI agent that uses gradient-based optimization instead of tree search for machine learning engineering tasks, achieving 35.1% success rate on MLE-Bench. The study shows gradient-based approaches outperform tree search as AI reasoning capabilities improve, suggesting this method will become more effective as LLMs advance.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

AISSISTANT: Human-AI Collaborative Review and Perspective Research Workflows in Data Science

Researchers introduce AIssistant, an open-source framework that combines human expertise with AI agents to streamline scientific review and perspective paper creation in data science. The system uses 15 specialized LLM-driven agents across two workflows and demonstrates 65.7% time savings while maintaining research quality through strategic human oversight.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Researchers propose Phase-Aware Mixture of Experts (PA-MoE) to improve reinforcement learning for LLM agents by addressing simplicity bias where simple tasks dominate network parameters. The approach uses a phase router to maintain temporal consistency in expert assignments, allowing better specialization for complex tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Researchers introduce Hierarchical Preference Learning (HPL), a new framework that improves AI agent training by using preference signals at multiple granularities - trajectory, group, and step levels. The method addresses limitations in existing Direct Preference Optimization approaches and demonstrates superior performance on challenging agent benchmarks through a dual-layer curriculum learning system.

AIBullisharXiv โ€“ CS AI ยท Mar 27/1016
๐Ÿง 

PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents

Researchers introduce PseudoAct, a new framework that uses pseudocode synthesis to improve large language model agent planning and action control. The method achieves significant performance improvements over existing reactive approaches, with a 20.93% absolute gain in success rate on FEVER benchmark and new state-of-the-art results on HotpotQA.

AIBullisharXiv โ€“ CS AI ยท Mar 27/1012
๐Ÿง 

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Researchers introduced Rudder, a software module that uses Large Language Models (LLMs) to optimize data prefetching in distributed Graph Neural Network training. The system shows up to 91% performance improvement over baseline training and 82% over static prefetching by autonomously adapting to dynamic conditions.

AIBullisharXiv โ€“ CS AI ยท Feb 276/107
๐Ÿง 

Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

Researchers introduce AHCE (Active Human-Augmented Challenge Engagement), a framework that enables AI agents to collaborate with human experts more effectively through learned policies. The system achieved 32% improvement on normal difficulty tasks and 70% on difficult tasks in Minecraft experiments by treating humans as interactive reasoning tools rather than simple help sources.

AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Researchers propose EMPOยฒ, a new hybrid reinforcement learning framework that improves exploration capabilities for large language model agents by combining memory augmentation with on- and off-policy optimization. The framework achieves significant performance improvements of 128.6% on ScienceWorld and 11.3% on WebShop compared to existing methods, while demonstrating superior adaptability to new tasks without requiring parameter updates.

AINeutralarXiv โ€“ CS AI ยท Mar 175/10
๐Ÿง 

Schema First Tool APIs for LLM Agents: A Controlled Study of Tool Misuse, Recovery, and Budgeted Performance

A research study examined how different tool interface designs affect LLM agent performance under strict interaction budgets. While schema-based interfaces reduced contract violations, they didn't improve overall task success or semantic understanding, suggesting that formal tool specifications alone aren't sufficient for reliable AI agent operation.

AINeutralarXiv โ€“ CS AI ยท Mar 54/10
๐Ÿง 

On the Suitability of LLM-Driven Agents for Dark Pattern Audits

Researchers evaluated LLM-driven agents' ability to identify dark patterns in web interfaces, specifically testing on 456 data broker websites processing CCPA data rights requests. The study examined whether AI agents can reliably detect manipulative design elements that discourage users from exercising their privacy rights.

โ† PrevPage 3 of 3