TSAQA: Time Series Analysis Question And Answering Benchmark
Researchers introduce TSAQA, a comprehensive benchmark for evaluating time series analysis capabilities in large language models across six diverse tasks and 210k samples. Current LLMs struggle significantly with temporal analysis, with even top commercial models achieving only 65% accuracy, revealing substantial gaps in their ability to handle complex time series reasoning.
The introduction of TSAQA addresses a critical gap in AI evaluation frameworks by extending beyond narrow time series tasks to encompass a holistic view of temporal analysis capabilities. Traditional benchmarks have focused primarily on forecasting and anomaly detection, but real-world applications demand multi-faceted analysis including characterization, comparison, transformation, and relationship identification. This expansion mirrors the broader trend in AI evaluation toward more comprehensive, domain-specific assessment tools that reflect practical use cases.
The benchmark's design reflects growing recognition that general-purpose LLMs require specialized evaluation for domain-critical applications. Finance, healthcare, and environmental science rely heavily on accurate time series interpretation, yet current models demonstrate fundamental limitations. The stark performance gap—with Gemini-2.5-Flash achieving only 65.08% accuracy—suggests that LLMs struggle with temporal reasoning despite their strengths in text processing. The inclusion of diverse question formats (true-false, multiple-choice, and novel puzzling formats) ensures comprehensive evaluation of different cognitive demands in temporal analysis.
This work carries significant implications for organizations deploying LLMs in time-sensitive domains. Financial institutions, healthcare providers, and climate researchers cannot reliably depend on current models for critical temporal analysis without substantial fine-tuning. The findings validate the importance of continued LLM development focused on temporal reasoning and domain-specific optimization. The performance improvements shown through instruction tuning suggest viable paths forward, yet the persistent gap indicates that fundamental architectural improvements may be necessary for enterprise deployment in time series applications.
- →TSAQA benchmark covers six time series analysis tasks across 13 domains with 210k samples, significantly expanding beyond existing narrow benchmarks.
- →Top commercial LLMs achieve only 65% accuracy on time series analysis despite their general capabilities, revealing critical limitations in temporal reasoning.
- →Instruction tuning improves open-source model performance but still leaves substantial room for advancement in time series question answering tasks.
- →The benchmark employs diverse question formats including a novel puzzling format designed to comprehensively assess temporal analysis capabilities.
- →Current performance gaps suggest that LLM deployment in finance, healthcare, and environmental applications requires specialized fine-tuning for reliable temporal analysis.