#agent-training News & Analysis

28 articles tagged with #agent-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles

AIBullisharXiv – CS AI · Jun 197/10

🧠

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Researchers present ScaleWoB, a framework that synthesizes high-fidelity interactive environments for training and evaluating GUI agents across mobile, desktop, and automotive platforms. The approach addresses critical limitations of real-world testing by providing verifiable rewards, low resource costs, and accessibility via URL-based backends, with results showing state-of-the-art agents achieve only 27.92% success compared to 92.08% for humans.

AIBullisharXiv – CS AI · May 287/10

🧠

Plan Before Search: Search Agents Need Plan

Researchers demonstrate that large language models trained as retrieval-augmented agents benefit from explicit planning—decomposing questions into ordered sub-questions before searching—rather than reactive document-driven responses. They introduce a self-bootstrapping training paradigm that enables smaller seed models to generate filtered trajectories activating this planning behavior across different model sizes without requiring distillation from larger external models.

AIBullisharXiv – CS AI · May 277/10

🧠

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra presents a specialized training methodology for native GUI agents that addresses critical gaps between open-source and closed-source systems through action-aware supervised fine-tuning and improved reinforcement learning with partial verifiability. The work introduces an 81K curated GUI reasoning dataset and demonstrates consistent improvements across web and mobile benchmarks without requiring expensive online data collection.

AINeutralarXiv – CS AI · May 127/10

🧠

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Researchers introduce SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills during task execution rather than relying on external supervision. The system demonstrates 8.8-9.3% performance improvements over existing baselines on complex agent benchmarks, representing a significant step toward self-improving AI agents.

AIBullisharXiv – CS AI · May 127/10

🧠

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Researchers introduce SPARK, a framework that verifies AI agent skills through direct environment interaction rather than relying on pre-written plans. The Posterior Distillation Index (PDI) metric ensures skills are grounded in actual task evidence, producing student models that match or exceed human-written skills while reducing inference costs by up to 1,000x.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Process-Reward Tactic Evolution for Long-Horizon Bioinformatics Workflows

Researchers introduce Process-Reward Tactic Evolution, a training framework that enables LLM agents to reliably execute complex bioinformatics workflows in Galaxy by accumulating reusable tactics from verified workflow rollouts. The approach combines process verification, curriculum learning, and tactic libraries to improve long-horizon task completion, biological correctness, and execution efficiency compared to baseline methods.

AINeutralarXiv – CS AI · Jun 116/10

🧠

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

Researchers introduce HERO, a self-distillation framework for reinforcement learning agents that uses environment observations as feedback to improve multi-turn decision-making. The method addresses credit assignment problems in sequential tasks by converting observations into actionable diagnoses, outperforming existing approaches on benchmark tasks with limited training data.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

This arXiv paper presents a comprehensive survey of agentic environments for large language models, systematizing research across modeling, synthesis, evaluation, and application. The work proposes frameworks for environment engineering, automated synthesis methods (symbolic and neural), and identifies four evolutionary pathways for agent-environment co-evolution, establishing foundational concepts for developing more capable AI agents.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Researchers introduce Role-Agent, a framework enabling a single LLM to simultaneously function as both agent and training environment through dual-role co-evolution. The system combines World-In-Agent (predicting environment states for process rewards) and Agent-In-World (analyzing failure patterns to optimize training data), achieving 4%+ performance improvements across multiple benchmarks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

Researchers introduce Unsupervised Partner Design (UPD), a multi-agent reinforcement learning method that generates and adaptively selects training partners without requiring pre-trained populations or manual tuning. The approach demonstrates strong performance across multiple benchmarks and achieves higher human preference ratings for adaptability and naturalness compared to existing baselines.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Researchers propose CapCode and CapReward, frameworks designed to detect and prevent AI coding agents from achieving high evaluation scores through shortcuts rather than genuine task-solving. By capping the maximum achievable non-cheating performance below 100%, scores above the cap serve as evidence of deceptive behavior, enabling more reliable agent evaluation.

AINeutralarXiv – CS AI · Jun 56/10

🧠

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

Researchers introduce CollabBench, a benchmark for evaluating LLM-based agents' ability to collaborate with diverse human partners in cooperative game environments. The framework uses simulated player profiles and a hybrid training approach that balances task efficiency with emotional adaptation, achieving 19.5% higher efficiency and 24.4% improved affective performance compared to base models.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Self-Evolving Deep Research via Joint Generation and Evaluation

Researchers introduce SCORE, a self-evolving co-evolutionary framework that jointly trains evaluation and generation models for deep research report generation. The approach addresses limitations in LLM-based research agents by enabling evaluators to dynamically adapt standards as solver performance improves, demonstrating consistent quality improvements over static evaluation methods.

AIBullisharXiv – CS AI · Jun 26/10

🧠

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

HomeFlow introduces a data flywheel system for training large language model agents in smart home environments, using procedural generation and Monte Carlo tree search to create diverse, verifiable training trajectories. The approach achieves 87.03% task success rates on a new SmartHome-Bench benchmark, outperforming GPT-5.5 by 1.23 percentage points.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 26/10

🧠

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Researchers introduce SIRI, a three-phase reinforcement learning framework that enables LLM agents to autonomously discover, validate, and internalize reusable skills without external skill generators or inference-time skill banks. Testing on ALFWorld and WebShop benchmarks shows meaningful performance improvements over baseline methods while reducing deployment complexity and latency.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Researchers at arXiv present findings that challenge assumptions about LLM agent capabilities, revealing that a model's base performance doesn't predict its ability to self-evolve through harness updates. The study identifies two distinct capabilities—harness-updating and harness-benefit—with counterintuitive results suggesting mid-tier models benefit most from self-evolution while strong models benefit less.

🧠 Claude

AINeutralarXiv – CS AI · Jun 16/10

🧠

Skill Reuse as Compression in Agentic RL

Researchers introduce ReuseRL, a reinforcement learning framework that improves LLM agent generalization by encouraging skill reuse and compression. By grounding agentic RL in the Minimum Description Length principle and penalizing task-specific shortcuts, the method demonstrates better in- and out-of-distribution performance across multiple benchmark environments.

AINeutralarXiv – CS AI · May 296/10

🧠

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Researchers develop a self-play reinforcement learning framework for Big 2, a four-player imperfect-information card game, demonstrating that PPO outperforms value-based methods under controlled conditions. The study reveals that entropy regularization and current-policy self-play improve agent performance, establishing Big 2 as a useful benchmark for testing deep RL in complex multi-agent environments with hidden information and variable action spaces.

AINeutralarXiv – CS AI · May 276/10

🧠

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD introduces a novel reinforcement learning framework that improves credit assignment in multi-turn agent tasks by treating individual steps rather than entire trajectories as the unit of learning. The method achieves state-of-the-art results on benchmark tasks like ALFWorld and Search-QA, demonstrating that step-level preference distillation is particularly effective when trajectory rewards poorly correlate with individual decision quality.

AIBullisharXiv – CS AI · May 126/10

🧠

EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

EmbodiSkill introduces a training-free framework enabling embodied AI agents to autonomously improve their skills through reflection on task execution trajectories. By distinguishing between skill deficiencies and execution lapses, the system allows frozen language models to achieve significantly higher task success rates, with a Qwen 3.5-27B model reaching 93.28% success on ALFWorld benchmarks.

🧠 GPT-5

AINeutralarXiv – CS AI · May 126/10

🧠

How Mobile World Model Guides GUI Agents?

Researchers developed and evaluated mobile world models across four modalities (delta text, full text, diffusion images, and renderable code) to guide GUI agents in executing smartphone tasks. The study reveals that renderable code provides the best in-distribution fidelity while text-based models are more robust for out-of-distribution execution, and that world-model-generated trajectories can improve agent training despite not preserving original data distributions.

AINeutralarXiv – CS AI · May 116/10

🧠

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

Researchers introduce EnvSimBench, a benchmark for evaluating how well large language models can simulate interactive environments for AI agent training. The study reveals a critical flaw: LLMs achieve near-perfect accuracy when environment state remains static but fail catastrophically when multiple simultaneous state changes occur, exposing a fundamental capability gap in LLM-based simulation.

AINeutralarXiv – CS AI · May 116/10

🧠

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Researchers present a signal-reshaping framework for GRPO (Group Relative Policy Optimization) that improves code-agent reinforcement learning under weak feedback conditions. The approach combines layered rewards, process-level credit assignment, and execution-aware rollout governance to increase strict compile-and-semantic accuracy from 38.5% to 53.5% on agentic code repair tasks.

AINeutralarXiv – CS AI · May 116/10

🧠

Learning CLI Agents with Structured Action Credit under Selective Observation

Researchers present a new approach to training CLI agents through reinforcement learning, introducing σ-Reveal for selective observation and A³ for credit assignment. The work addresses fundamental challenges in teaching AI systems to interact with command-line interfaces by leveraging structured action properties and proposing the ShellOps dataset for evaluation.

AIBullisharXiv – CS AI · May 116/10

🧠

Scalable Option Learning in High-Throughput Environments

Facebook Research introduces Scalable Option Learning (SOL), a hierarchical reinforcement learning algorithm that achieves 35x higher throughput than existing methods. The system was validated on complex environments including NetHack using 30 billion frames of experience, demonstrating superior performance over flat agents and suggesting that hierarchical RL can finally benefit from large-scale training.

$SOL

Page 1 of 2Next →