🧠 AI🔴 BearishImportance 6/10

HEARTS: Benchmarking LLM Reasoning on Health Time Series

arXiv – CS AI|Sirui Li, Shuhan Xiao, Mihir Joshi, Ahmed Metwally, Daniel McDuff, Wei Wang, Yuzhe Yang|March 17, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce HEARTS, a comprehensive benchmark for evaluating large language models' ability to reason over health time series data across 16 datasets and 12 health domains. The study reveals that current LLMs significantly underperform compared to specialized models and struggle with multi-step temporal reasoning in healthcare applications.

Key Takeaways

→HEARTS benchmark integrates 16 real-world datasets across 12 health domains and 20 signal modalities for LLM evaluation.
→Current state-of-the-art LLMs substantially underperform specialized models on health time series tasks.
→LLM performance on health data shows weak correlation with general reasoning capabilities.
→Models struggle with multi-step temporal reasoning and rely on simple heuristics rather than complex analysis.
→Performance degrades with increasing temporal complexity, suggesting scaling alone is insufficient for healthcare AI.