🧠 AI⚪ NeutralImportance 6/10

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

arXiv – CS AI|Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee|March 16, 2026 at 04:00 AM

🤖AI Summary

SkillsBench introduces a new benchmark to evaluate Agent Skills - structured packages of procedural knowledge that enhance LLM agents. Testing across 86 tasks and 11 domains shows curated Skills improve performance by 16.2 percentage points on average, while self-generated Skills provide no benefit.

Key Takeaways

→Curated Agent Skills improve LLM performance by 16.2 percentage points on average across diverse tasks.
→Performance gains vary significantly by domain, from +4.5pp in Software Engineering to +51.9pp in Healthcare.
→Self-generated Skills provide no performance benefit, indicating models cannot create the procedural knowledge they benefit from using.
→Focused Skills with 2-3 modules outperform comprehensive documentation approaches.
→Smaller models equipped with Skills can match the performance of larger models without Skills.