Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study
A controlled study examines how large-language-model agents perform with different skill documentation formats using SkillsBench, finding that skill availability dramatically improves task success (18-36 percentage points) while variations in presentation granularity produce minimal and uncertain effects across models.
This research addresses a fundamental question in AI systems design: does the way we present information to language models matter as much as whether we present it at all? The study uses rigorous experimental methodology with 1,800 data points across two major models (GPT-5.5 and DeepSeek V4-Flash), comparing six skill presentation conditions over 30 balanced tasks with multiple trials. The dominant finding emerges clearly—providing skill documents to agents substantially improves their ability to complete tasks, with gains ranging from 18 to 36 percentage points depending on the model. This validates the core premise that procedural knowledge injection at inference time delivers meaningful value. However, the secondary findings prove more nuanced and potentially surprising. When researchers tested whether low-abstraction guidance (detailed, granular instructions) outperformed high-abstraction guidance (conceptual summaries), they found negligible differences—just 0.7 and -6.7 percentage points respectively—with confidence intervals spanning zero. Adding worked examples to medium-level abstractions yielded similarly modest improvements of 0.7 to 1.3 percentage points. These results suggest that presentation granularity operates within a plateau zone where differences become almost imperceptible to model performance. For developers building agent systems, the implication is clear: investing heavily in optimizing documentation format may yield diminishing returns compared to ensuring skills are available. The model-dependent variation in abstraction effects (favoring low-abstraction for GPT-5.5, slightly favoring high-abstraction for DeepSeek) indicates architecture-specific tuning could matter more than presentation choices alone. This controlled subset study establishes a performance floor, leaving open questions about how these patterns scale to larger, more complex task domains.
- →Providing skill documents improves LLM agent task success by 18-36 percentage points compared to no skills.
- →Presentation granularity differences between low and high abstraction levels show minimal, uncertain, and model-dependent effects.
- →Adding worked examples to skill guidance produces negligible improvements of under 2 percentage points.
- →Skill availability matters far more than how skills are formatted or explained to the model.
- →Results vary between model architectures, suggesting architecture-specific optimization may outweigh presentation choices.