AINeutralarXiv – CS AI · Apr 146/10
🧠
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
Researchers introduced FinTrace, a benchmark dataset with 800 expert-annotated trajectories for evaluating how large language models perform financial tool-calling tasks. The study reveals that while frontier LLMs excel at selecting appropriate tools, they struggle significantly with information utilization and generating accurate final outputs, pointing to a critical reasoning gap that persists even after fine-tuning with preference optimization techniques.