Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
Researchers demonstrate that supervised financial NLP benchmarks used to evaluate LLMs contain hidden measurement risks, where rubric wording, metric selection, and aggregation methods materially alter model performance rankings. Testing on the Japanese Financial Implicit-Commitment Recognition dataset reveals 13-point agreement variance across rubric variants and shows that certain metrics produce unreliable signals, highlighting the need for standardized evaluation governance in financial AI model selection.
Large language models are increasingly deployed to analyze financial documents—earnings calls, investor guidance, and regulatory disclosures—with supervised benchmarks serving as the primary evidence for model selection. This research exposes a critical vulnerability: the assumption that gold-standard labeled datasets produce objective evaluation results breaks down when the measurement instrument itself is unstable.
The study systematically varies rubric wording, metrics, and temperature settings across four frontier LLMs on a 253-item Japanese financial dataset. The findings are sobering. Rubric variants produced label agreement ranging only from 70% to 83%, with most disagreement clustering around boundary cases where implicit commitment status is ambiguous. This suggests that how evaluators frame the task—not just the underlying model capability—drives performance scores. When evaluators then choose different metrics (within-one accuracy, worst-class accuracy, exact accuracy, macro-F1, weighted kappa), they can reach contradictory conclusions about which model ranks highest, even on identical predictions.
For financial institutions and AI teams selecting models for production use, this measurement risk has material consequences. A model ranked second under one metric-rubric combination could rank first under another, potentially influencing millions in deployment decisions. The research demonstrates that Bradley-Terry, Borda, and Ranked Pairs ranking algorithms agree only when restricted to statistically robust metrics, but diverge when all five metrics are included.
The contribution shifts focus from leaderboard construction to evaluation governance. The authors advocate for explicit reporting of rubric design choices, metric justifications, and aggregation logic—transforming financial NLP benchmarking from a black-box comparison into a documented, auditable process. This discipline becomes essential as LLMs gain credibility in high-stakes financial decision-making.
- →Rubric wording variations produced 13-point swings in model label agreement on financial data, indicating measurement instability rather than true capability differences.
- →Not all evaluation metrics remain valid under real-world class distributions; worst-class and within-one accuracy metrics generated unreliable signals in this benchmark.
- →Model rankings diverged based on metric selection alone, with consensus only emerging after identifying statistically robust metric subsets.
- →Gold-labeled benchmarks require explicit governance policies on rubric design, metric selection, and aggregation to produce defensible model comparisons.
- →Financial institutions using NLP benchmarks for model selection should demand transparency on evaluation methodology, not just final leaderboard positions.