OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
A large-scale benchmarking study finds that powerful Multimodal Large Language Models (MLLMs) can extract information from business documents using image-only input, potentially eliminating the need for traditional OCR preprocessing. The research demonstrates that well-designed prompts and instructions can further enhance MLLM performance in document processing tasks.