SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows
SKILL.nb is a new framework that improves AI agent reliability by selectively formalizing workflow steps based on execution evidence, storing them as versioned notebooks with natural language guidance and executable code. The system achieved 53.7% success on web automation tasks and retained 91.7% performance across multiple re-executions, significantly outperforming existing baselines in handling environment drift and task specification changes.
SKILL.nb addresses a fundamental challenge in deploying reusable AI agent workflows: reliability degradation over time. As AI agents increasingly generate reusable artifacts like code and procedural memories, these components often fail when deployed in new environments or against slightly different task distributions. This lifecycle problem becomes acute in web automation, where even minor environmental changes can break previously successful workflows. The framework's innovation lies in selective formalization—dynamically deciding which workflow steps should be hardcoded versus guided by natural language based on real execution evidence.
The technical approach uses versioned, auditable notebooks that interleave multiple execution modes: executable code cells, natural language guidance, validation gates, and fallback paths with multimodal evidence tracking. This hybrid approach acknowledges that not all workflow steps benefit equally from formalization; some remain more robust when flexibly guided. Gate-conditioned execution enables graceful degradation, allowing steps to fall back to alternative approaches when validation gates detect environmental drift.
The empirical results demonstrate meaningful advances in reliability. On WebArena-Verified benchmarks, SKILL.nb achieves 53.7% single-round success while retaining 91.7% performance across three re-executions—substantially better than baselines that degrade significantly with repeated execution. The system recovers 72.9% of subsequent failures under bounded repair while limiting regressions to 4.2%, indicating effective failure recovery without cascading damage. Cross-domain testing on Mind2Web and real-world GitLab migration scenarios confirm that the approach generalizes beyond single benchmarks.
These results suggest that lifecycle governance and multi-modal execution strategies represent underexplored reliability axes in AI systems, with implications for production deployment of autonomous agents across web automation, enterprise workflow, and infrastructure management domains.
- →SKILL.nb uses evidence-based selective formalization to dynamically decide which workflow steps should be executable code versus natural-language guided
- →The system achieved 53.7% success on WebArena-Verified and retained 91.7% performance across repeated executions, outperforming baselines by 15.5 percentage points
- →Gate-conditioned execution enables graceful degradation when environment drift is detected, with fallback paths preventing catastrophic failures
- →Versioned notebook design makes workflows auditable and revisable, supporting continuous improvement through accumulated execution evidence
- →Framework demonstrates applicability across domains including web automation, cross-website tasks, and real infrastructure migrations like GitLab version upgrades