AIBearisharXiv – CS AI · 3h ago6/10
🧠
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
Researchers demonstrate that Vision-Language Models (VLMs) used for optical character recognition produce fluent but visually unsupported text, relying heavily on language priors rather than actual image content. Testing on Ancient Greek critical editions reveals VLMs generate plausible errors while traditional OCR produces local noise, with token-level grounding analysis showing model-specific vulnerabilities to hallucination.