AINeutralarXiv – CS AI · 18h ago6/10
🧠
MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
Researchers introduced MBABench, a new evaluation framework for testing LLM agents on end-to-end financial spreadsheet tasks—a capability increasingly demanded by enterprises but not yet adequately measured by existing benchmarks. The study found that even top-performing models like Claude fall short of professional finance standards, struggling with complex multi-step workflows and degrading sharply in quality as task difficulty increases.
🧠 Claude