AIBearisharXiv โ CS AI ยท 8h ago7/10
๐ง
Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
Researchers demonstrate that supervised financial NLP benchmarks used to evaluate LLMs contain hidden measurement risks, where rubric wording, metric selection, and aggregation methods materially alter model performance rankings. Testing on the Japanese Financial Implicit-Commitment Recognition dataset reveals 13-point agreement variance across rubric variants and shows that certain metrics produce unreliable signals, highlighting the need for standardized evaluation governance in financial AI model selection.