AIBearisharXiv โ CS AI ยท 4d ago6/104
๐ง
Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles
Researchers introduced SciTrek, a new benchmark for testing large language models' ability to perform numerical reasoning across long scientific documents. The benchmark reveals significant challenges for current LLMs, with the best model achieving only 46.5% accuracy at 128K tokens, and performance declining as context length increases.
$COMP