y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#document-understanding News & Analysis

8 articles tagged with #document-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles
AIBullisharXiv – CS AI · Apr 157/10
🧠

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Researchers introduce DocSeeker, a multimodal AI system designed to improve long document understanding by implementing structured analysis, localization, and reasoning workflows. The breakthrough addresses critical limitations in existing large language models that struggle with lengthy documents due to high noise levels and weak training signals, achieving superior performance on both short and ultra-long documents.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI · Mar 56/10
🧠

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Researchers have developed a lightweight token pruning framework that reduces computational costs for vision-language models in document understanding tasks by filtering out non-informative background regions before processing. The approach uses a binary patch-level classifier and max-pooling refinement to maintain accuracy while substantially lowering compute demands.

AINeutralarXiv – CS AI · May 296/10
🧠

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

Researchers introduce MPDocBench-Parse, a new benchmark dataset for evaluating multi-page document parsing systems across realistic, complex scenarios. The benchmark comprises 433 manually annotated documents spanning 3,246 pages in 15 document types, revealing that existing AI models excel at basic text extraction but struggle with semantic continuity, visual content preservation, and hierarchical structure recovery.

AINeutralarXiv – CS AI · May 276/10
🧠

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Researchers introduce Doc-CoB, a new framework that improves how AI models understand documents by progressively focusing on relevant layout regions while maintaining global context. The approach combines coarse-to-fine visual reasoning with multimodal large language models and demonstrates significant performance improvements across seven benchmarks.

AINeutralarXiv – CS AI · May 116/10
🧠

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

LensVLM is a new inference framework that enables Vision Language Models to process highly compressed images of text by selectively expanding relevant sections, achieving 4.3x compression while maintaining accuracy comparable to full-resolution processing. The approach combines learned tool selection with post-training techniques to overcome the fundamental limitation that compressed text becomes illegible to standard vision encoders.

AINeutralarXiv – CS AI · May 16/10
🧠

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Researchers introduced COHERENCE, a new benchmark for evaluating Multimodal Large Language Models (MLLMs) on their ability to understand fine-grained image-text alignment in interleaved contexts—such as documents with mixed text and images. The benchmark contains 6,161 high-quality questions across four domains and includes error analysis to identify specific capability gaps in current models.