#document-understanding News & Analysis

10 articles tagged with #document-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles

AIBullisharXiv – CS AI · Apr 157/10

🧠

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Researchers introduce DocSeeker, a multimodal AI system designed to improve long document understanding by implementing structured analysis, localization, and reasoning workflows. The breakthrough addresses critical limitations in existing large language models that struggle with lengthy documents due to high noise levels and weak training signals, achieving superior performance on both short and ultra-long documents.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Researchers have developed a lightweight token pruning framework that reduces computational costs for vision-language models in document understanding tasks by filtering out non-informative background regions before processing. The approach uses a binary patch-level classifier and max-pooling refinement to maintain accuracy while substantially lowering compute demands.

AINeutralarXiv – CS AI · Jun 256/10

🧠

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Researchers introduce HG-Bench, a benchmark dataset of 500 annotated homework samples for evaluating automated grading systems' ability to locate and decompose handwritten student answers across multiple pages. Current AI models, including frontier VLMs, achieve less than 55% accuracy on complete answer localization, revealing a significant capability gap in understanding spatial reasoning structures in handwritten documents.

AINeutralarXiv – CS AI · Jun 96/10

🧠

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

Researchers introduced TABVERSE, a new benchmark for evaluating how Large Language Models and Vision-Language Models understand tables across different formats (HTML, Markdown, LaTeX, and images). The study reveals that table representation significantly impacts model performance, with structured text formats generally outperforming rendered images, though performance varies by task and model type.

AINeutralarXiv – CS AI · May 296/10

🧠

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

Researchers introduce MPDocBench-Parse, a new benchmark dataset for evaluating multi-page document parsing systems across realistic, complex scenarios. The benchmark comprises 433 manually annotated documents spanning 3,246 pages in 15 document types, revealing that existing AI models excel at basic text extraction but struggle with semantic continuity, visual content preservation, and hierarchical structure recovery.

AINeutralarXiv – CS AI · May 276/10

🧠

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Researchers introduce Doc-CoB, a new framework that improves how AI models understand documents by progressively focusing on relevant layout regions while maintaining global context. The approach combines coarse-to-fine visual reasoning with multimodal large language models and demonstrates significant performance improvements across seven benchmarks.

AINeutralarXiv – CS AI · May 116/10

🧠

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

LensVLM is a new inference framework that enables Vision Language Models to process highly compressed images of text by selectively expanding relevant sections, achieving 4.3x compression while maintaining accuracy comparable to full-resolution processing. The approach combines learned tool selection with post-training techniques to overcome the fundamental limitation that compressed text becomes illegible to standard vision encoders.

AINeutralarXiv – CS AI · May 16/10

🧠

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Researchers introduced COHERENCE, a new benchmark for evaluating Multimodal Large Language Models (MLLMs) on their ability to understand fine-grained image-text alignment in interleaved contexts—such as documents with mixed text and images. The benchmark contains 6,161 high-quality questions across four domains and includes error analysis to identify specific capability gaps in current models.

AIBullisharXiv – CS AI · Apr 136/10

🧠

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Researchers introduce VISOR, a new agentic visual retrieval-augmented generation system that improves how AI models reason over multi-page visual documents. By addressing key technical challenges in evidence gathering and context management, VISOR achieves state-of-the-art results on complex visual reasoning tasks.