🧠 AI⚪ NeutralImportance 6/10

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

arXiv – CS AI|Yuyang Wu, Yue Huang, Shuaike Shen, Xujian Wang, Shuhao Zhang, Qiyao Xue, Weichen Liu, Runtian Gao, Jian Ma, Xiangliang Zhang, Olexandr Isayev|May 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce ChemCost, a benchmark for evaluating LLM agents on chemical cost estimation from reaction descriptions. The study reveals that even frontier LLMs achieve only 50.6% accuracy on clean inputs and degrade significantly with realistic noise, exposing brittleness in parsing, evidence integration, and tool use despite access to domain-specific tools.

Analysis

The paper addresses a critical gap in AI evaluation: while LLMs demonstrate increasing capability in tool use, scientific domains lack rigorous benchmarks with ground-truth scoring independent of subjective judgment. Chemical cost reasoning requires agents to perform precise procedural reasoning—grounding chemical identities, retrieving supplier data, selecting valid procurement options, normalizing quantities, and computing accurate costs. This mirrors real-world domain applications where approximate or heuristic answers carry material consequences.

The ChemCost benchmark's construction with 1,427 reactions, 2,261 chemicals, and 230,775 supplier quotes frozen to a specific pricing snapshot enables repeatable, objective evaluation without LLM-as-judge scoring. This methodological rigor matters because it moves beyond curated demonstrations and expert assessment toward reproducible science.

The experimental results expose a fundamental limitation: tool access alone does not guarantee competent tool use. Agents reaching only 50.6% accuracy within 25% relative error on clean data, then degrading under noise injection, indicate fragility in multi-step reasoning chains. Stage-level diagnosis identifies specific failure modes—brittle parsing of chemical names and quantities, poor integration of retrieved evidence, invalid pack selection logic, and non-convergent tool invocation patterns—rather than attributing failures to vague "reasoning gaps."

For developers building chemistry-focused AI systems, the findings highlight that production deployments require robust error handling, validation layers, and human-in-the-loop oversight before automating procurement decisions. The noise-injection framework provides a practical stress-testing methodology applicable to other domains requiring precise procedural grounding.

Key Takeaways

→LLMs achieve only 50.6% accuracy on chemical cost estimation despite tool access, indicating tool availability does not ensure competent deployment
→Stage-level analysis identifies specific failure modes: brittle parsing, weak evidence integration, and invalid pack selection rather than generic reasoning failures
→Noise-injection testing reveals substantial accuracy degradation under realistic input perturbations, questioning real-world robustness of agent systems
→ChemCost benchmark provides objective, ground-truth evaluation methodology applicable beyond chemistry to other domain-specific tool-use tasks
→Production chemistry AI systems require additional validation layers and human oversight before automating procurement decisions