AINeutralarXiv – CS AI · 18h ago6/10
🧠
TQA-Bench: Evaluating LLMs for Multi-Table Question Answering
Researchers introduce TQA-Bench, a comprehensive benchmark for evaluating large language models on multi-table question answering tasks using real-world datasets with variable context lengths (8K-64K tokens). The evaluation of LLMs ranging from 2 billion to 671 billion parameters reveals significant performance gaps in handling complex relational data structures, addressing a critical gap in existing benchmarks that focus primarily on single-table QA.