🧠 AI⚪ NeutralImportance 6/10

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

arXiv – CS AI|Maksim Ivanov, Abhijay Rana|May 27, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Anchor, a task-generation pipeline that addresses 'artifact drift' in AI agent benchmarking by automatically creating consistent instructions, environments, solutions, and verifiers from formal specifications. The team releases ERP-Bench, a 300-task benchmark for enterprise workflows, finding frontier AI models solve only 17.4% of tasks optimally despite meeting explicit constraints 26.1% of the time.

Analysis

The paper tackles a critical infrastructure problem in AI development: the disconnect between task specifications, evaluation environments, and success criteria that plague benchmark creation. Artifact drift—where loosely coordinated processes produce inconsistent or unsolvable tasks—undermines the reliability of agent training and evaluation. By formalizing domain expert knowledge into constraint optimization programs, Anchor creates verifiable task pipelines where all components derive from a single parametric source, eliminating consistency failures.

This work emerges from the growing pressure to evaluate AI agents on realistic, economically valuable tasks. Traditional benchmarks often struggle to balance authenticity with measurability; Anchor solves this by anchoring evaluation to business correctness rather than reward-hacking vulnerabilities. The release of ERP-Bench demonstrates the approach's viability, providing 300 long-horizon tasks in procurement and manufacturing—domains where enterprise customers need reliable agent performance.

The results reveal a substantial gap between constrain-satisfaction and optimal performance. Frontier models achieving 26.1% constraint compliance but only 17.4% optimal solutions suggest current agents excel at partial task completion while struggling with fully correct end-states. This finding has direct implications for enterprises considering agent deployment: current models may require significant supervision or hybrid human-AI workflows for mission-critical operations.

The broader impact extends beyond benchmarking. Anchor's methodology could standardize how enterprises evaluate custom agents, reducing deployment risk and enabling transparent vendor comparison. As AI agents move toward autonomous business operations, auditable evaluation frameworks become essential infrastructure. The open release signals confidence in the approach and invites community expansion into other enterprise domains.

Key Takeaways

→Anchor eliminates artifact drift in agent benchmarking by deriving task instructions, environments, solutions, and verifiers from unified parametric specifications.
→ERP-Bench reveals frontier models satisfy task constraints in 26% of cases but achieve fully optimal solutions only 17.4% of the time.
→The framework enables controlled task difficulty generation with formally verified ground-truth solutions, improving evaluation reliability.
→Enterprise adoption of autonomous agents requires auditable evaluation environments; Anchor provides a replicable methodology for building them.
→Open-source release of Anchor and ERP-Bench establishes a potential standard for evaluating AI agents on economically valuable business workflows.