y0news
โ† Feed
โ†Back to feed
๐Ÿง  AI๐Ÿ”ด BearishImportance 6/10

Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles

arXiv โ€“ CS AI|Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata||4 views
๐Ÿค–AI Summary

Researchers introduced SciTrek, a new benchmark for testing large language models' ability to perform numerical reasoning across long scientific documents. The benchmark reveals significant challenges for current LLMs, with the best model achieving only 46.5% accuracy at 128K tokens, and performance declining as context length increases.

Key Takeaways
  • โ†’SciTrek benchmark tests LLMs on counting, sorting, and comparing information across multiple full-text scientific articles.
  • โ†’Even the best-performing LLM achieved only 46.5% exact match accuracy at 128K token contexts.
  • โ†’Model performance degrades as context length increases, highlighting limitations in long-context reasoning.
  • โ†’LLMs particularly struggle with citation-related questions and compound logical conditions including negation.
  • โ†’The benchmark uses SQL queries over article metadata to generate verifiable questions with ground-truth answers.
Mentioned Tokens
$COMP$0.0000โ–ฒ+0.0%
Let AI manage these โ†’
Non-custodial ยท Your keys, always
Read Original โ†’via arXiv โ€“ CS AI
Act on this with AI
This article mentions $COMP.
Let your AI agent check your portfolio, get quotes, and propose trades โ€” you review and approve from your device.
Connect Wallet to AI โ†’How it works
Related Articles