AINeutralarXiv โ CS AI ยท 14h ago6/10
๐ง
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
Researchers introduced FinTrace, a benchmark dataset with 800 expert-annotated trajectories for evaluating how large language models perform financial tool-calling tasks. The study reveals that while frontier LLMs excel at selecting appropriate tools, they struggle significantly with information utilization and generating accurate final outputs, pointing to a critical reasoning gap that persists even after fine-tuning with preference optimization techniques.