βBack to feed
π§ AIπ΄ BearishImportance 6/10
Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles
π€AI Summary
Researchers introduced SciTrek, a new benchmark for testing large language models' ability to perform numerical reasoning across long scientific documents. The benchmark reveals significant challenges for current LLMs, with the best model achieving only 46.5% accuracy at 128K tokens, and performance declining as context length increases.
Key Takeaways
- βSciTrek benchmark tests LLMs on counting, sorting, and comparing information across multiple full-text scientific articles.
- βEven the best-performing LLM achieved only 46.5% exact match accuracy at 128K token contexts.
- βModel performance degrades as context length increases, highlighting limitations in long-context reasoning.
- βLLMs particularly struggle with citation-related questions and compound logical conditions including negation.
- βThe benchmark uses SQL queries over article metadata to generate verifiable questions with ground-truth answers.
#llm#benchmark#long-context#numerical-reasoning#scientific-articles#performance-evaluation#ai-limitations#context-length#citation-analysis
Read Original βvia arXiv β CS AI
Act on this with AI
This article mentions $COMP.
Let your AI agent check your portfolio, get quotes, and propose trades β you review and approve from your device.
Related Articles