βBack to feed
π§ AIπ’ BullishImportance 7/10
EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
arXiv β CS AI|Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen||4 views
π€AI Summary
Surge AI introduces CoreCraft, the first environment in EnterpriseBench for training AI agents on realistic enterprise workflows. Training GLM 4.6 on this high-fidelity customer support simulation improved task performance from 25% to 37% and showed positive transfer to other benchmarks, demonstrating that quality training environments enable generalizable AI capabilities.
Key Takeaways
- βCoreCraft simulates a complete customer support organization with 2,500+ entities and 23 tools to test AI agent capabilities on real-world tasks.
- βLeading frontier models like GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of expert-evaluated tasks.
- βTraining GLM 4.6 with GRPO improved performance from 25.37% to 36.76% on held-out evaluation tasks.
- βThe training gains transferred to other benchmarks with improvements of 4.5-7.4% across different domains.
- βHigh-fidelity, diverse training environments with expert evaluation are key to developing generalizable AI agents.
#ai-agents#reinforcement-learning#enterprise-ai#benchmarking#training-environments#generalization#customer-support#surge-ai
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles