Researchers propose AMREC, a new agentic framework that improves text-guided molecular generation by shifting focus from merely fixing invalid chemical structures to preserving target-relevant molecular identity. The approach outperforms existing correction strategies by combining molecule-aware tracking with expanded candidate exploration, achieving superior recovery across multiple evaluation metrics on invalid molecular drafts.
The challenge of generating valid molecules from text descriptions represents a critical bottleneck in computational chemistry and drug discovery. Large language models frequently produce invalid SMILES strings—the textual notation for chemical structures—requiring post-hoc corrections that often distort key structural features or introduce unintended modifications. AMREC addresses this by reframing the problem from a validity-centric repair task to an identity-preserving recovery objective, recognizing that restoring chemical validity means nothing if the recovered molecule no longer matches the target description.
This advancement builds on broader efforts to integrate LLMs with molecular design pipelines. Previous approaches treated invalid outputs as simple errors requiring correction, either through specialized repair modules or by feeding corrections back to LLMs. These methods suffered from fundamental tradeoffs: post-hoc repairs restored validity but damaged crucial molecular features, LLM-only corrections caused unpredictable structural drift, and even sophisticated agentic systems with RDKit tools remained trapped in greedy single-candidate exploration spaces.
AMREC's contribution lies in coupling molecule-aware mismatch tracking—understanding what structural elements matter—with expanded trajectory exploration and selection at the trajectory level rather than individual edits. Testing on ChEBI-20 invalid drafts across multiple backbone models shows measurable improvements in structural, exact-match, and string-level metrics. This work matters for computational chemistry and drug discovery pipelines where combining LLM reasoning with chemical validity constraints could accelerate molecule discovery. The approach demonstrates how domain-aware constraints can meaningfully improve AI system outputs beyond generic correction strategies.
- →AMREC shifts molecular correction from validity-repair to identity-preserving recovery, maintaining structural integrity while fixing chemical validity.
- →Traditional post-hoc repair and LLM-only correction approaches create tradeoffs between validity and structural fidelity that AMREC addresses.
- →The framework couples molecule-aware mismatch tracking with expanded candidate exploration, improving performance across multiple evaluation metrics.
- →Results demonstrate measurable advantages on ChEBI-20 invalid molecular drafts across three different backbone models.
- →Domain-specific constraints and trajectory-level selection outperform greedy single-candidate approaches in agentic molecular correction.