AIBearisharXiv – CS AI · 6h ago6/10
🧠
TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
Researchers introduced TxBench-PP, a benchmark testing AI agents' ability to analyze real-world drug discovery data rather than regurgitate memorized information. Testing 11 AI models across 4,800 trajectories revealed significant limitations: even the best-performing system (Claude Opus) succeeded only 59% of the time on preclinical pharmacology tasks, suggesting AI agents require substantial improvement before reliable deployment in drug discovery workflows.
🧠 GPT-5🧠 Claude🧠 Opus