AINeutralarXiv – CS AI · 7h ago6/10
🧠
BlueFin: Benchmarking LLM Agents on Financial Spreadsheets
BlueFin is a new benchmark dataset that evaluates how well large language model agents perform on real-world financial spreadsheet tasks, revealing that even frontier LLMs struggle significantly with complex spreadsheet manipulation and analysis despite their advanced capabilities.