AIBearisharXiv – CS AI · 3h ago7/10
🧠
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems
Researchers introduce RAMP, a production-grounded assessment framework that reveals significant performance degradation in LLM agents under real-world conditions, with task completion rates collapsing from 100% to 20% across serial workflows. Testing 15 mainstream models shows that traditional benchmarks mask critical failures in long-horizon execution chains, while computational costs vary by three orders of magnitude between comparable models.