AIBearisharXiv โ CS AI ยท 10h ago6/10
๐ง
HEARTS: Benchmarking LLM Reasoning on Health Time Series
Researchers introduce HEARTS, a comprehensive benchmark for evaluating large language models' ability to reason over health time series data across 16 datasets and 12 health domains. The study reveals that current LLMs significantly underperform compared to specialized models and struggle with multi-step temporal reasoning in healthcare applications.