🧠 AI🔴 BearishImportance 7/10

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

arXiv – CS AI|Ruizhe Zhou, Xiaoyang Liu, Gaoyuan Du, Yi Zheng, Shouxi Ren, Deepayan Chakrabarti, Dengdu Jiang|May 28, 2026 at 04:00 AM

🤖AI Summary

A comprehensive survey reveals that machine learning systems deployed in regulated financial sectors—credit risk, fraud detection, and anti-money laundering—suffer from reproducibility failures caused by hardware-level nondeterminism in neural networks and generative AI. The research quantifies specific vulnerabilities across tabular models, graph networks, and LLM-based workflows, proposing evaluation frameworks to improve auditability in financial AI systems.

Analysis

Financial institutions face a critical gap between algorithmic accuracy and regulatory auditability. While traditional statistical ML addressed backtest overfitting, modern deep learning and generative AI introduce mechanical nondeterminism—inconsistent outputs from identical inputs due to hardware parallelization, stochastic sampling, and asynchronous operations. This creates compliance friction in credit scoring, fraud detection, and AML systems where regulators increasingly demand explainability and reproducibility.

The research identifies three failure modes across AI modalities. Tabular models show explanation rank instability, meaning the features a credit scoring model highlights can shift between runs. Graph neural networks exhibit prediction flip rates in fraud detection due to stochastic sampling during inference. LLM-based entity extraction diverges based on batch processing configurations, introducing trajectory drift that complicates audit trails.

Regulatory pressure compounds this challenge. Financial institutions cannot deploy systems they cannot fully explain or reproduce to regulators. Institutions using black-box models for high-stakes decisions face potential enforcement action, while the inability to reproduce audit explanations undermines liability defenses. This creates economic pressure to either constrain AI deployments to simpler, deterministic methods or invest heavily in reproducibility infrastructure.

The proposed layered evaluation framework—combining modality-specific metrics like RBO (rank-biased overlap) and D_cos (cosine distance) with logit and semantic-level determinism measures—offers a path forward. Financial institutions must prioritize reproducibility engineering alongside model accuracy, treating determinism as a first-class requirement rather than an afterthought. This shapes vendor selection, infrastructure choices, and the competitive advantage of institutions that solve reproducibility early.

Key Takeaways

→Modern financial AI systems exhibit hardware-induced nondeterminism that creates reproducibility failures in credit scoring, fraud detection, and AML workflows.
→Explanation instability in neural networks and trajectory drift in LLM agents pose significant regulatory and compliance risks for financial institutions.
→Determinism must be measured at multiple levels—logit-level and semantic-level—to properly assess audit readiness across different AI modalities.
→Graph neural networks for fraud detection show measurable prediction flip rates between inference runs, undermining confidence in decision consistency.
→Financial institutions cannot achieve true regulatory compliance without treating reproducibility as a core requirement alongside model accuracy and fairness.