Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents
Trace2Policy introduces EISR, a systematic method to extract and refine implicit decision rules from expert behavior through iterative error analysis. Deployed at a major logistics carrier for 22 days, the approach achieved 79.6% accuracy with deterministic Python execution, outperforming LLM-based baselines by 9.8 percentage points and eliminating inference-time LLM dependency.
Trace2Policy addresses a critical gap in enterprise AI: converting tacit expert knowledge into transparent, executable rules rather than relying on black-box language models. The research demonstrates that for high-stakes, skewed-distribution tasks like compliance and auditing, rule quality—not model capability—drives performance. This insight reframes the AI engineering problem from pursuing ever-larger models to systematizing how humans actually make decisions.
The Error-driven Iterative Skill Refinement mechanism operates as a feedback loop: rules execute against validation data, errors cluster into semantic categories (missing logic, incorrect conditions, conflicting rules), patches apply surgically, and regression gates prevent degradation. This human-readable optimization target stands in sharp contrast to end-to-end deep learning, enabling auditability and explainability—essential for regulated industries. The 22-day production deployment at a logistics carrier validates the approach beyond academic benchmarks.
A striking finding emerges: the same EISR-refined rules executed as compiled Python outperform identical rules prompted to an LLM by 9.8 percentage points (79.6% vs. 70%), while eliminating runtime LLM costs and latency. This challenges the assumption that LLM prompting is the optimal execution path. The Auto-EISR variant further democratizes the methodology, automating refinement at $5-10 per cycle versus 70 expert-hours, and transfers to legal reasoning and process-mining benchmarks without reengineering.
For enterprise adoption, this work signals a shift toward hybrid systems: LLMs for initial rule synthesis and exploration, deterministic execution for production reliability. The approach also highlights that in skewed-base-rate domains, accuracy metrics alone obscure the performance gap between transparent rules and opaque models.
- →Rule quality, not model size, dominates performance on compliance-sensitive, skewed-distribution decision tasks.
- →EISR-refined rules compiled to deterministic Python achieved 79.6% accuracy, outperforming LLM execution by 9.8 percentage points at zero inference cost.
- →Production deployment across 3,349 audit cases confirmed that re-enabling LLM fallback monotonically degraded accuracy, contradicting conventional AI strategies.
- →Auto-EISR automates expert-knowledge extraction at $5-10 per refinement cycle versus 70 expert-hours, enabling scalable policy discovery.
- →The approach transfers to legal reasoning and process-mining benchmarks without task-specific reengineering, suggesting broad applicability across knowledge-work domains.