CFAgentBench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents
Researchers introduce CFAgentBench, a comprehensive benchmark for testing autonomous AI agents in construction finance workflows. The benchmark includes 1,014 task specifications across real software tools (ERP, payroll, banking portals) with strict functional grading, revealing that top models achieve only 67% accuracy on single attempts but collapse to 38% when consistency is required.
CFAgentBench addresses a critical gap in AI agent evaluation by moving beyond static benchmarks to test real-world operational competence in high-stakes financial environments. Unlike previous benchmarks that prioritize raw accuracy metrics, this framework implements a money-movement guard that treats correct task execution as failure if it involves unsupervised financial transactions—reflecting actual deployment requirements where human approval gates payments and filings. This design philosophy directly challenges industry assumptions about agent readiness.
The benchmark's architecture mirrors production construction-finance stacks with 35 mock applications grounded in authentic workflows. The distinction between pass@1 (single attempt) and pass@5 (five attempts at temperature-0) performance exposes a fundamental reliability crisis: the 43% collapse between metrics demonstrates that models cannot consistently replicate correct behaviors, a dealbreaker for financial automation. This heterogeneity across domains suggests agents lack robust understanding rather than capability gaps in specific areas.
For AI developers and financial technology companies, these results validate skepticism about deploying large language models in autonomous finance roles without significant safety infrastructure. The public-private split methodology prevents benchmark contamination while enabling reproducible evaluation. The framework's emphasis on functional correctness over LLM-judged quality sets a higher bar than common practice, making reported performance metrics more trustworthy. Construction firms evaluating agent-based automation should recognize that published single-attempt accuracies overstate deployable competence by 30-40% based on this data, necessitating significantly higher raw performance thresholds for production use.
- →Top open-weight AI agents achieve only 67% single-attempt accuracy on construction-finance tasks, dropping to 38% when consistency is required.
- →CFAgentBench implements a money-movement guard requiring human approval for financial transactions, preventing autonomous execution of even correct financial operations.
- →The 43% performance collapse between single and repeated attempts reveals that current models cannot reliably replicate correct behaviors in complex workflows.
- →The benchmark uses 1,014 task specifications grounded in real construction-finance software, making it more representative of actual deployment requirements than prior benchmarks.
- →Results demonstrate that single-attempt accuracy metrics significantly overstate agent readiness for production financial operations by 30-40%.