AINeutralarXiv – CS AI · 9h ago6/10
🧠
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
Researchers introduce ChemCost, a benchmark for evaluating LLM agents on chemical cost estimation from reaction descriptions. The study reveals that even frontier LLMs achieve only 50.6% accuracy on clean inputs and degrade significantly with realistic noise, exposing brittleness in parsing, evidence integration, and tool use despite access to domain-specific tools.