AIBullisharXiv – CS AI · May 287/10
🧠Researchers propose COSE, a self-evolution framework for large language models that uses confidence signals to filter noisy self-generated training feedback without external verifiers. The method demonstrates consistent improvements across 19 benchmarks and multiple model sizes (0.6B–4B parameters), achieving state-of-the-art performance in reasoning and mathematics tasks.
🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce G-Zero, a verifier-free framework that enables large language models to improve autonomously through self-play without relying on external judges or proxy models. The approach uses an intrinsic reward mechanism called Hint-δ to identify and address the Generator model's blind spots, achieving scalable self-evolution across unverifiable domains.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce SkillOS, a reinforcement learning framework that enables LLM-based agents to autonomously curate and evolve reusable skills from experience rather than relying on manual intervention. The system pairs a frozen agent executor with a trainable skill curator that manages an external skill repository, demonstrating consistent improvements in effectiveness and efficiency across multi-turn and single-turn tasks while generalizing across different agent architectures.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers propose a framework for sustainable AI self-evolution through triadic roles (Proposer, Solver, Verifier) that ensures learnable information gain across iterations. The study identifies three key system designs to prevent the common plateau effect in self-play AI systems: asymmetric co-evolution, capacity growth, and proactive information seeking.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers at arXiv present findings that challenge assumptions about LLM agent capabilities, revealing that a model's base performance doesn't predict its ability to self-evolve through harness updates. The study identifies two distinct capabilities—harness-updating and harness-benefit—with counterintuitive results suggesting mid-tier models benefit most from self-evolution while strong models benefit less.
🧠 Claude
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce BenchTrace, a benchmark framework for evaluating how well large language model agents learn from failures through reflection and self-evolution. Testing on Qwen3-32B and GPT-4.1 reveals significant limitations: both models achieve below 30% accuracy on reflection tasks, struggle with diagnosis, and experience performance degradation as noise accumulates in their learning processes.
🧠 GPT-4
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce PTCG-Bench, a benchmark using the Pokémon Trading Card Game to evaluate how well large language model agents can master complex strategic games and improve through self-experience. The study reveals that while LLM agents demonstrate competent gameplay, they struggle with sustained self-evolution and are heavily influenced by system design choices.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers identify capability erosion in self-evolving LLM agents, where systems adapting to new tasks progressively lose previously learned abilities across workflow, skill, model, and memory dimensions. The study proposes Capability-Preserving Evolution (CPE), a stabilization framework that maintains performance on existing tasks while enabling new adaptations, demonstrating improvements in retained capability stability across all evolution channels.
🧠 GPT-5
AINeutralarXiv – CS AI · May 126/10
🧠MAGE introduces a novel framework for self-evolving language model agents that uses co-evolutionary knowledge graphs to preserve learned knowledge across iterations without modifying the base model. The system externalizes learning into structured memory subgraphs, enabling frozen backbone models to improve through retrieved guidance while maintaining inference stability across nine diverse benchmarks.
AIBullisharXiv – CS AI · May 126/10
🧠EmbodiSkill introduces a training-free framework enabling embodied AI agents to autonomously improve their skills through reflection on task execution trajectories. By distinguishing between skill deficiencies and execution lapses, the system allows frozen language models to achieve significantly higher task success rates, with a Qwen 3.5-27B model reaching 93.28% success on ALFWorld benchmarks.
🧠 GPT-5
AINeutralarXiv – CS AI · Mar 164/10
🧠Researchers introduce Steve-Evolving, a new AI framework for open-world embodied agents that uses fine-grained diagnosis and knowledge distillation to improve long-horizon task performance. The system organizes interaction experiences into structured tuples and continuously evolves without model parameter updates, showing improvements in Minecraft testing environments.