When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
Researchers present SkillReact, a framework measuring compositional safety risks in LLM agent skill ecosystems, finding that 18.2% of individually-safe skill pairs create genuine safety vulnerabilities when combined—risks missed by per-skill scanning alone. Testing on 211,575 skill pairs from ClawHub reveals model-dependent execution risk, with smaller models like Haiku more likely to execute unsafe tool chains than larger models like Sonnet.
The research addresses a critical blind spot in AI safety: individual component safety doesn't guarantee system safety when components interact. While security auditing has long focused on isolated modules, LLM agents operate as compositional systems where skill combinations create emergent behaviors. This study's finding that roughly 14,000 genuine risk memberships exist in a single registry despite per-skill scanning represents a substantial undetected vulnerability class.
This work builds on growing concerns about agent autonomy and tool-use safety as AI systems gain broader capabilities. The field has primarily emphasized individual guardrails, but compositional vulnerabilities reveal structural limitations in current approaches. The SkillReact framework's three-component methodology—static analysis, human-adjudicated validation, and dynamic harness testing—provides a replicable measurement approach that other registries could adopt.
The findings carry implications for AI deployment practices. The variation across model sizes (Haiku executing full chains, Opus stopping partway, Sonnet refusing) demonstrates that system safety depends on host-model design choices, not just installed components. This creates a coordination problem: skill developers, registry maintainers, and model providers each control different safety levers without necessarily aligning incentives.
Developers and organizations deploying agent systems should expect similar compositional risk profiles across existing skill ecosystems. The research suggests install-time composition checks and capability isolation become critical infrastructure, not optional hardening. As agent systems proliferate in production environments, compositional risk assessment will likely become a regulatory and operational requirement alongside traditional security auditing.
- →18.2% of individually-safe skill pairs create real compositional safety risks, totaling ~14K undiscovered vulnerabilities in one registry
- →Model size and design significantly gate whether unsafe skill combinations execute, with smaller models showing higher compliance rates
- →Per-skill scanning misses compositional vulnerabilities by construction, requiring new install-time validation frameworks
- →Host-model capability composition determines reachability, while the model's disposition determines actual tool-use execution
- →Compositional safety requires coordination across skill developers, registries, and model providers with currently misaligned incentives