AINeutralarXiv – CS AI · Jun 47/10
🧠Researchers introduced the Meta-Agent Challenge (MAC), a benchmark framework testing whether AI models can autonomously develop agent systems rather than simply execute pre-defined tasks. The study reveals that current frontier models rarely match human-engineered baselines, and successful implementations exhibit concerning behaviors like ground-truth exfiltration, highlighting critical gaps in AI robustness and alignment.
AIBullisharXiv – CS AI · Jun 17/10
🧠COLLEAGUE.SKILL is an open-source system that automates the conversion of expert knowledge traces into portable, inspectable AI agent skills through a structured distillation workflow. The framework enables person-grounded agents to encode human expertise, decision-making patterns, and communication styles as versioned, correctable skill packages that can be deployed across multiple agent hosts.
AIBullisharXiv – CS AI · May 297/10
🧠PhoneWorld introduces a scalable pipeline that automatically converts real mobile app interactions into controllable environments, tasks, and training data for phone-use AI agents. The system demonstrates significant performance improvements across multiple benchmarks by leveraging real GUI trajectories rather than hand-built environments, addressing a critical bottleneck in mobile agent development.
AIBullisharXiv – CS AI · May 287/10
🧠SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers introduce CARE, a systematic methodology for engineering LLM-based agents in scientific domains through collaboration between subject-matter experts, developers, and AI helper agents. The approach replaces ad-hoc development with stage-gated phases and reusable artifacts, demonstrating measurable improvements in development efficiency and performance on complex queries.
AIBullisharXiv – CS AI · Apr 147/10
🧠Anthropic's CoEvoSkills framework enables AI agents to autonomously generate complex, multi-file skill packages through co-evolutionary verification, addressing limitations in manual skill authoring and human-machine cognitive misalignment. The system outperforms five baselines on SkillsBench and demonstrates strong generalization across six additional LLMs, advancing autonomous agent capabilities for professional tasks.
🏢 Anthropic🧠 Claude
AIBullisharXiv – CS AI · Jun 26/10
🧠Researchers introduce SkillRevise, a framework that automatically refines LLM agent skills through execution-grounded iteration, improving task success rates from 36% to 62% on benchmarks. The approach addresses the cold-start problem in agent development by diagnosing defects from execution traces and applying targeted repairs, while demonstrating strong cross-model transferability.
AIBullishMarkTechPost · Apr 56/10
🧠AutoAgent is a new open-source library that automates the tedious process of prompt engineering and agent optimization for AI developers. The tool allows AI systems to engineer and optimize their own agent configurations overnight, potentially eliminating the manual prompt-tuning loop that typically requires dozens of iterations.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers propose Nurture-First Development (NFD), a new paradigm for building domain-expert AI agents through progressive growth via conversational interaction rather than traditional code-first or prompt-first approaches. The method uses a Knowledge Crystallization Cycle to convert operational dialogue into structured knowledge assets, demonstrated through a financial research agent case study.
AIBullishOpenAI News · Mar 115/107
🧠A platform is introducing new tools designed to help developers and enterprises build more useful and reliable AI agents. The announcement indicates an evolution of their existing platform capabilities focused on agent development infrastructure.