AIBullisharXiv – CS AI · Jun 97/10
🧠Researchers introduce Anything2Skill, a framework that converts external knowledge sources into reusable, executable skills for AI agents. By combining skill extraction with retrieval-augmented generation, the system achieves 98.85% success on command-line tasks and 94.10% on GitHub operations, significantly outperforming RAG-only approaches.
AIBullisharXiv – CS AI · Jun 27/10
🧠Researchers have developed an AI framework that transforms materials synthesis procedures from unstructured narrative text into actionable, computable knowledge using large language models and structured databases. The system successfully optimized boron nitride nanosheet synthesis in three iterations, demonstrating AI's potential to accelerate complex materials discovery beyond traditional trial-and-error approaches.
AIBullisharXiv – CS AI · May 297/10
🧠Researchers introduce GRASP, a method for improving large language model agents through controlled skill library updates that prevent performance regression. Tested across five base models on clinical benchmarks, GRASP achieves dramatic improvements (40.6% to 88.8% on MedAgentBench) while maintaining stability, outperforming existing self-improvement approaches by significant margins.
🧠 GPT-4🧠 GPT-5🧠 Gemini
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce MIND-Skill, an automated framework that generates reusable skills for LLM-powered AI agents by analyzing successful task trajectories. The system uses dual agents with quality-control mechanisms to create generalizable, documented procedures that enable autonomous systems to handle complex, multi-step problems without manual human expertise.
AINeutralarXiv – CS AI · Jun 116/10
🧠Researchers introduce SkillJuror, a framework measuring how LLM agent skill organization affects runtime behavior independent of content. Testing Progressive Disclosure—a hierarchical skill structure—against flat baselines shows agents access 3.26x more resources and achieve 4.1% higher verification rates, revealing that procedural knowledge presentation meaningfully influences agent reasoning patterns.
AIBullisharXiv – CS AI · Jun 86/10
🧠Researchers introduce W2S, a framework for automatically constructing high-quality skills for large language model agents by decomposing execution traces into workflow structures, semantics, and attachments. The approach outperforms traditional summarization methods by 10.5%, demonstrating that treating traces as executable specifications rather than text yields more reliable agent behavior.
AIBullisharXiv – CS AI · Jun 26/10
🧠Researchers introduce SkillRevise, a framework that automatically refines LLM agent skills through execution-grounded iteration, improving task success rates from 36% to 62% on benchmarks. The approach addresses the cold-start problem in agent development by diagnosing defects from execution traces and applying targeted repairs, while demonstrating strong cross-model transferability.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers introduce MMG2Skill, a framework that converts unstructured web guides into executable skills for AI agents, with a new benchmark for evaluation. The system improves agent performance by 12.8-25.3 percentage points across multiple domains by structuring knowledge, conditioning vision-language models on refined skills, and iteratively improving them from agent trajectories.
AINeutralarXiv – CS AI · May 275/10
🧠Researchers present a framework for managing uncertainty in language model-generated laboratory procedures for virtual educational environments. The system uses structured domain representations and LLM outputs to extract, validate, and repair procedural steps, addressing common LLM failures like missing actions, incorrect sequencing, and logical incompatibilities.
AIBullisharXiv – CS AI · May 126/10
🧠EmbodiSkill introduces a training-free framework enabling embodied AI agents to autonomously improve their skills through reflection on task execution trajectories. By distinguishing between skill deficiencies and execution lapses, the system allows frozen language models to achieve significantly higher task success rates, with a Qwen 3.5-27B model reaching 93.28% success on ALFWorld benchmarks.
🧠 GPT-5
AINeutralarXiv – CS AI · Mar 166/10
🧠SkillsBench introduces a new benchmark to evaluate Agent Skills - structured packages of procedural knowledge that enhance LLM agents. Testing across 86 tasks and 11 domains shows curated Skills improve performance by 16.2 percentage points on average, while self-generated Skills provide no benefit.