FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification
Researchers introduced FinVerBench, a benchmark for evaluating how well large language models verify financial statement accuracy using real SEC 10-K filings. Testing 14 contemporary LLMs revealed critical limitations: most models produced 95-100% false positives on clean statements, while performance varied dramatically based on how financial data was rendered, suggesting financial verification requires calibrated judgment beyond arithmetic detection.
FinVerBench addresses a critical gap in LLM evaluation by testing whether state-of-the-art language models can perform a task requiring logical consistency and numerical reasoning: verifying corporate financial statements. The benchmark's construction from real SEC XBRL filings with four error categories (arithmetic, cross-statement linkage, year-over-year changes, magnitude perturbations) provides a realistic testing ground increasingly important as enterprises consider AI for financial analysis tasks. The results expose a significant vulnerability in current LLMs: nine of fourteen tested models failed catastrophically on clean statements, producing false positives at rates exceeding 95%. This indicates models may be hallucinating verification conclusions rather than performing genuine validation. The finding that benchmark rendering choices materially affected results—with recall dropping from 100% on unrounded data to 79% on realistic rounded data—reveals that models are brittle and sensitive to formatting rather than building robust logical understanding. This distinction matters substantially for industry applications. Financial institutions evaluating LLMs for compliance, audit, or analytical functions cannot trust current models as primary verification tools without human oversight. The research suggests financial statement verification demands more than pattern recognition; it requires understanding incomplete information, prompt constraints, and real-world numerical conventions. For AI developers, these findings underscore that benchmark design choices can mask fundamental capability gaps. The publicly available benchmark enables the community to track progress, but current results indicate substantial work remains before LLMs achieve reliable financial verification at production quality levels.
- →Most leading LLMs produced 95-100% false positives when verifying clean financial statements, indicating fundamental reasoning gaps rather than minor errors.
- →Benchmark rendering choices significantly impact measured performance, suggesting models memorize formats rather than understanding underlying financial logic.
- →The best-performing model achieved 0% false positives but only 79% recall on realistic rounded data, demonstrating the accuracy-coverage tradeoff.
- →Financial statement verification requires calibrated judgment under incomplete information, not just arithmetic detection capabilities.
- →FinVerBench provides the first standardized evaluation framework for this critical financial AI use case using real SEC 10-K data.