🧠 AI⚪ NeutralImportance 7/10

CCTU: A Benchmark for Tool Use under Complex Constraints

arXiv – CS AI|Junjie Ye, Guoqiang Zhang, Wenjie Fu, Tao Gui, Qi Zhang, Xuanjing Huang|March 17, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce CCTU, a new benchmark for evaluating large language models' ability to use tools under complex constraints. The study reveals that even state-of-the-art LLMs achieve less than 20% task completion rates when strict constraint adherence is required, with models violating constraints in over 50% of cases.

Key Takeaways

→New CCTU benchmark tests LLM tool use under complex constraints across 12 categories and 4 dimensions.
→No state-of-the-art LLM achieves above 20% task completion rate when strict constraint adherence is required.
→Models violate constraints in over 50% of cases, particularly in resource and response dimensions.
→LLMs show limited self-refinement capacity even after receiving detailed feedback on constraint violations.
→The benchmark includes 200 test cases with an average of 7 constraint types per case.