FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking
Researchers present FinRAG-12B, a 12-billion parameter language model specifically optimized for banking applications that achieves GPT-4.1-level performance on citation grounding while maintaining safer refusal rates and operating at 20-50x lower cost. The model is already deployed across 40+ financial institutions with proven 7.1 percentage point improvements in query resolution.
FinRAG-12B addresses a critical gap in AI adoption within highly regulated industries where accuracy, explainability, and cost efficiency are non-negotiable requirements. Traditional large language models like GPT-4.1 struggle with banking's dual demands: they either over-refuse questions to avoid errors or generate unsupported claims, making them unsuitable for customer-facing financial applications. This work demonstrates that domain-specific optimization through careful data curation and calibrated training yields superior outcomes across multiple dimensions simultaneously.
The banking sector's resistance to LLM adoption stems from legitimate concerns about regulatory compliance, hallucination risks, and the inability to audit model reasoning. FinRAG-12B solves these problems through three innovations: a data-efficient pipeline using LLM-as-Judge filtering that requires only 143M tokens, a calibrated refusal mechanism that maintains a rational 12% "I don't know" rate versus GPT-4.1's excessive 20.2%, and end-to-end deployment methodology ensuring production readiness. The model achieves this while outperforming GPT-4.1 specifically on citation grounding—the ability to cite source documents backing answers.
The real-world validation across 40+ institutions represents substantial market validation. A 7.1 percentage point improvement in query resolution translates to measurable customer satisfaction gains and operational efficiency. The 3-5x speed advantage and dramatic cost reduction make widespread deployment economically viable for institutions previously unable to justify LLM adoption. This establishes a template for domain-specific AI development in finance, potentially accelerating LLM adoption across banking, insurance, and compliance functions where similar requirements exist.
- →FinRAG-12B achieves GPT-4.1-level citation grounding performance while operating 20-50x cheaper and 3-5x faster
- →The model maintains a calibrated 12% refusal rate, substantially safer than base models' 4.3% while avoiding GPT-4.1's 20.2% over-refusal
- →Data-efficient training on just 143M tokens enables high performance on domain-specific tasks through LLM-as-Judge filtering and curriculum learning
- →Production deployment across 40+ financial institutions achieved 7.1 percentage point statistically significant improvement in query resolution
- →Success demonstrates viability of domain-specific LLM optimization for highly regulated industries prioritizing accuracy and explainability over raw capability