โBack to feed
๐ง AI๐ด BearishImportance 6/10
Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles
๐คAI Summary
Researchers introduced SciTrek, a new benchmark for testing large language models' ability to perform numerical reasoning across long scientific documents. The benchmark reveals significant challenges for current LLMs, with the best model achieving only 46.5% accuracy at 128K tokens, and performance declining as context length increases.
Key Takeaways
- โSciTrek benchmark tests LLMs on counting, sorting, and comparing information across multiple full-text scientific articles.
- โEven the best-performing LLM achieved only 46.5% exact match accuracy at 128K token contexts.
- โModel performance degrades as context length increases, highlighting limitations in long-context reasoning.
- โLLMs particularly struggle with citation-related questions and compound logical conditions including negation.
- โThe benchmark uses SQL queries over article metadata to generate verifiable questions with ground-truth answers.
#llm#benchmark#long-context#numerical-reasoning#scientific-articles#performance-evaluation#ai-limitations#context-length#citation-analysis
Read Original โvia arXiv โ CS AI
Act on this with AI
This article mentions $COMP.
Let your AI agent check your portfolio, get quotes, and propose trades โ you review and approve from your device.
Related Articles