BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
Researchers introduced BankerToolBench (BTB), an open-source benchmark to evaluate AI agents on investment banking workflows developed with 502 professional bankers. Testing nine frontier models revealed that even the best performer (GPT-5.4) fails nearly half of evaluation criteria, with zero outputs rated client-ready, highlighting significant gaps in AI readiness for high-stakes professional work.
BankerToolBench represents a crucial shift in AI evaluation methodology, moving beyond generic benchmarks toward profession-specific, economically grounded assessment frameworks. The benchmark's development with 502 investment bankers from leading firms ensures ecological validity—the tasks reflect actual workflows rather than synthetic problems. Each task replicates real work that junior bankers spend up to 21 hours completing, creating meaningful economic stakes that align incentives between AI capability and business value.
This research reflects broader recognition that current AI benchmarks fail to capture the complexity of professional environments. Generic language model benchmarks don't assess an agent's ability to maintain consistency across multiple output formats (Excel models, PowerPoint decks, PDF reports), navigate proprietary data systems, or meet stakeholder quality standards. The 100+ rubric criteria developed by veteran bankers represent nuanced professional judgment that automated metrics typically miss.
The testing results carry substantial implications for enterprise AI deployment timelines. GPT-5.4's failure on nearly 50% of criteria despite being a frontier model signals that human-in-the-loop workflows remain necessary for high-liability professional services. Banks cannot delegate complex analytical work to AI without significant oversight, contradicting earlier narratives about imminent autonomous professional agents.
The detailed failure analysis identifying cross-artifact consistency breakdowns provides actionable guidance for model developers. These findings will likely influence investment banking firms' AI adoption strategies, pushing them toward narrower, more controlled applications rather than end-to-end workflow automation. The benchmark itself becomes valuable infrastructure for the AI development community, enabling systematic progress measurement against professional standards rather than arbitrary metrics.
- →GPT-5.4, the best-performing model tested, failed nearly 50% of banker-defined quality criteria with zero client-ready outputs.
- →BankerToolBench establishes profession-specific evaluation standards combining task execution, deliverable quality, and stakeholder utility metrics.
- →Individual tasks requiring up to 21 hours of banker work demonstrate the significant economic value at stake in professional AI adoption.
- →Cross-artifact consistency failures represent a key technical obstacle limiting current AI agents' applicability to complex professional workflows.
- →The benchmark's development with 502 practicing bankers ensures evaluation criteria reflect genuine professional standards rather than academic assumptions.