AIBullisharXiv – CS AI · 8h ago6/10
🧠
When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs
Researchers propose a new benchmarking framework for evaluating large language models in retrosynthesis planning, introducing ChemCensor—a metric prioritizing chemical plausibility over exact-match accuracy—and CREED, a dataset of millions of validated reaction records that improves model performance beyond existing LLM baselines.