SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval
Researchers introduce SkillResolve-Bench, a benchmark for evaluating agent skill retrieval systems that addresses the critical problem of selecting the correct skill variant when multiple capabilities are semantically similar. The benchmark includes 661 helper/risky skill pairs and proposes SkillResolve, a method that achieves safer procedural exposure by selecting appropriate skill representatives from capability families.
Agent systems increasingly rely on retrieving skills from large libraries to execute complex tasks, but current retrieval methods face a nuanced challenge: finding the right capability family is insufficient if the retrieved skill variant contains stale resources, missing preconditions, or incorrect procedures. This same-capability ambiguity problem represents a real execution risk that traditional relevance-based retrieval cannot adequately address. The SkillResolve-Bench benchmark tackles this gap by constructing paired scenarios where a helpful skill coexists with a risky sibling sharing the same capability family, forcing systems to discriminate beyond surface-level matching.
The research builds on growing recognition that AI agents need more sophisticated information retrieval mechanisms than simple semantic similarity. As agent architectures mature and skill libraries expand, the cost of selecting wrong procedural variants grows proportionally. The benchmark's design—including source-role evidence, cue leakage checks, and query-disjoint splits—reflects careful attention to preventing gaming and ensuring realistic evaluation conditions.
The SkillResolve method demonstrates that within-family representative selection is crucial for safety. By achieving zero harmful sibling exposure at top-3 (HSR@3=0) while maintaining strong recall metrics, the approach shows that capability retrieval quality depends on downstream family-level disambiguation. The 0.112 Recall and 0.165 NDCG improvements over SkillRouter establish meaningful performance gains. This work matters for developers building production agent systems where execution failures carry material consequences. Organizations deploying agents with skill libraries should prioritize retrieval methods that account for same-capability risks rather than relying on broad relevance matching.
- →Same-capability ambiguity in skill retrieval poses execution risks when multiple skills share capability families but differ in preconditions or procedures.
- →SkillResolve-Bench provides 661 auditable helper/risky skill pairs for benchmarking this safety-critical retrieval problem.
- →SkillResolve method achieves zero harmful sibling exposure at top-3 results through family-aware representative selection.
- →Within-family skill selection emerges as the critical mechanism separating safe from unsafe procedural exposure.
- →Current retrieval systems like SkillRouter expose risky siblings in 69.3% of top-3 results, highlighting widespread vulnerability.