AIBearisharXiv – CS AI · 6h ago7/10
🧠
Flaws in the LLM Automation Narrative
A new benchmarking study challenges the widespread narrative that large language models perform at expert-level on knowledge work tasks. By measuring variance and error magnitude alongside accuracy, researchers found that human experts outperformed frontier LLMs on a data analysis coding task, demonstrating that standard benchmarks fail to capture reliability and consistency—critical factors for high-stakes applications.