y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#agent-training News & Analysis

20 articles tagged with #agent-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

20 articles
AIBullisharXiv – CS AI · May 287/10
🧠

Plan Before Search: Search Agents Need Plan

Researchers demonstrate that large language models trained as retrieval-augmented agents benefit from explicit planning—decomposing questions into ordered sub-questions before searching—rather than reactive document-driven responses. They introduce a self-bootstrapping training paradigm that enables smaller seed models to generate filtered trajectories activating this planning behavior across different model sizes without requiring distillation from larger external models.

AIBullisharXiv – CS AI · May 277/10
🧠

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra presents a specialized training methodology for native GUI agents that addresses critical gaps between open-source and closed-source systems through action-aware supervised fine-tuning and improved reinforcement learning with partial verifiability. The work introduces an 81K curated GUI reasoning dataset and demonstrates consistent improvements across web and mobile benchmarks without requiring expensive online data collection.

AINeutralarXiv – CS AI · May 127/10
🧠

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Researchers introduce SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills during task execution rather than relying on external supervision. The system demonstrates 8.8-9.3% performance improvements over existing baselines on complex agent benchmarks, representing a significant step toward self-improving AI agents.

AIBullisharXiv – CS AI · May 127/10
🧠

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Researchers introduce SPARK, a framework that verifies AI agent skills through direct environment interaction rather than relying on pre-written plans. The Posterior Distillation Index (PDI) metric ensures skills are grounded in actual task evidence, producing student models that match or exceed human-written skills while reducing inference costs by up to 1,000x.

AIBullisharXiv – CS AI · 3d ago6/10
🧠

Self-Evolving Deep Research via Joint Generation and Evaluation

Researchers introduce SCORE, a self-evolving co-evolutionary framework that jointly trains evaluation and generation models for deep research report generation. The approach addresses limitations in LLM-based research agents by enabling evaluators to dynamically adapt standards as solver performance improves, demonstrating consistent quality improvements over static evaluation methods.

AIBullisharXiv – CS AI · 5d ago6/10
🧠

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

HomeFlow introduces a data flywheel system for training large language model agents in smart home environments, using procedural generation and Monte Carlo tree search to create diverse, verifiable training trajectories. The approach achieves 87.03% task success rates on a new SmartHome-Bench benchmark, outperforming GPT-5.5 by 1.23 percentage points.

🧠 GPT-5
AIBullisharXiv – CS AI · 5d ago6/10
🧠

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Researchers introduce SIRI, a three-phase reinforcement learning framework that enables LLM agents to autonomously discover, validate, and internalize reusable skills without external skill generators or inference-time skill banks. Testing on ALFWorld and WebShop benchmarks shows meaningful performance improvements over baseline methods while reducing deployment complexity and latency.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Researchers at arXiv present findings that challenge assumptions about LLM agent capabilities, revealing that a model's base performance doesn't predict its ability to self-evolve through harness updates. The study identifies two distinct capabilities—harness-updating and harness-benefit—with counterintuitive results suggesting mid-tier models benefit most from self-evolution while strong models benefit less.

🧠 Claude
AINeutralarXiv – CS AI · 6d ago6/10
🧠

Skill Reuse as Compression in Agentic RL

Researchers introduce ReuseRL, a reinforcement learning framework that improves LLM agent generalization by encouraging skill reuse and compression. By grounding agentic RL in the Minimum Description Length principle and penalizing task-specific shortcuts, the method demonstrates better in- and out-of-distribution performance across multiple benchmark environments.

AINeutralarXiv – CS AI · May 296/10
🧠

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Researchers develop a self-play reinforcement learning framework for Big 2, a four-player imperfect-information card game, demonstrating that PPO outperforms value-based methods under controlled conditions. The study reveals that entropy regularization and current-policy self-play improve agent performance, establishing Big 2 as a useful benchmark for testing deep RL in complex multi-agent environments with hidden information and variable action spaces.

AINeutralarXiv – CS AI · May 276/10
🧠

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD introduces a novel reinforcement learning framework that improves credit assignment in multi-turn agent tasks by treating individual steps rather than entire trajectories as the unit of learning. The method achieves state-of-the-art results on benchmark tasks like ALFWorld and Search-QA, demonstrating that step-level preference distillation is particularly effective when trajectory rewards poorly correlate with individual decision quality.

AIBullisharXiv – CS AI · May 126/10
🧠

EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

EmbodiSkill introduces a training-free framework enabling embodied AI agents to autonomously improve their skills through reflection on task execution trajectories. By distinguishing between skill deficiencies and execution lapses, the system allows frozen language models to achieve significantly higher task success rates, with a Qwen 3.5-27B model reaching 93.28% success on ALFWorld benchmarks.

🧠 GPT-5
AINeutralarXiv – CS AI · May 126/10
🧠

How Mobile World Model Guides GUI Agents?

Researchers developed and evaluated mobile world models across four modalities (delta text, full text, diffusion images, and renderable code) to guide GUI agents in executing smartphone tasks. The study reveals that renderable code provides the best in-distribution fidelity while text-based models are more robust for out-of-distribution execution, and that world-model-generated trajectories can improve agent training despite not preserving original data distributions.

AINeutralarXiv – CS AI · May 116/10
🧠

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

Researchers introduce EnvSimBench, a benchmark for evaluating how well large language models can simulate interactive environments for AI agent training. The study reveals a critical flaw: LLMs achieve near-perfect accuracy when environment state remains static but fail catastrophically when multiple simultaneous state changes occur, exposing a fundamental capability gap in LLM-based simulation.

AINeutralarXiv – CS AI · May 116/10
🧠

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Researchers present a signal-reshaping framework for GRPO (Group Relative Policy Optimization) that improves code-agent reinforcement learning under weak feedback conditions. The approach combines layered rewards, process-level credit assignment, and execution-aware rollout governance to increase strict compile-and-semantic accuracy from 38.5% to 53.5% on agentic code repair tasks.

AINeutralarXiv – CS AI · May 116/10
🧠

Learning CLI Agents with Structured Action Credit under Selective Observation

Researchers present a new approach to training CLI agents through reinforcement learning, introducing σ-Reveal for selective observation and A³ for credit assignment. The work addresses fundamental challenges in teaching AI systems to interact with command-line interfaces by leveraging structured action properties and proposing the ShellOps dataset for evaluation.

AIBullisharXiv – CS AI · May 116/10
🧠

Scalable Option Learning in High-Throughput Environments

Facebook Research introduces Scalable Option Learning (SOL), a hierarchical reinforcement learning algorithm that achieves 35x higher throughput than existing methods. The system was validated on complex environments including NetHack using 30 billion frames of experience, demonstrating superior performance over flat agents and suggesting that hierarchical RL can finally benefit from large-scale training.

$SOL
AIBullisharXiv – CS AI · Apr 206/10
🧠

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

EnvScaler is an automated framework that generates synthetic tool-interaction environments for training LLM agents through programmatic synthesis, creating 191 diverse environments and 7,000 scenarios. The approach addresses scalability challenges in LLM agent training by combining topic mining and logic modeling to overcome hallucinations and manual bottlenecks, demonstrating improved performance on multi-turn, multi-tool interaction tasks.

AIBullisharXiv – CS AI · Apr 136/10
🧠

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

Researchers introduce E3-TIR, a new training paradigm for Large Language Models that improves tool-use reasoning by combining expert guidance with self-exploration. The method achieves 6% performance gains while using less than 10% of typical synthetic data, addressing key limitations in current reinforcement learning approaches for AI agents.

AINeutralarXiv – CS AI · Feb 275/105
🧠

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Researchers propose Contrastive World Models (CWM), a new approach for training AI agents to better distinguish between physically feasible and infeasible actions in embodied environments. The method uses contrastive learning with hard negative examples to outperform traditional supervised fine-tuning, achieving 6.76 percentage point improvement in precision and better safety margins under stress conditions.