AIBearisharXiv – CS AI · 14h ago7/10
🧠
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification
Researchers introduced FinVerBench, a benchmark for evaluating how well large language models verify financial statement accuracy using real SEC 10-K filings. Testing 14 contemporary LLMs revealed critical limitations: most models produced 95-100% false positives on clean statements, while performance varied dramatically based on how financial data was rendered, suggesting financial verification requires calibrated judgment beyond arithmetic detection.
🧠 Gemini