Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR
Researchers introduce PuMVR, a benchmark revealing significant script-dependent bias in multilingual Vision-Language Models, where the same visual reasoning tasks produce accuracy gaps up to 16% depending on writing system used. The study exposes that current VLMs fail to handle multi-script languages like Punjabi equally, undermining claims of true multilingual capability and highlighting inequities in AI development.
The research addresses a critical blind spot in multilingual AI evaluation. While Vision-Language Models have achieved impressive benchmarks across languages, they operate under the simplifying assumption that language and script are one-to-one mappings. For the billions who use multi-script languages—Punjabi speakers switching between Gurmukhi, Shahmukhi, and Roman scripts—this assumption creates fractured capabilities that undermine practical utility.
This work emerges from growing recognition that multilingual AI benchmarks often miss crucial dimensions of real-world language use. Previous evaluations typically test single script per language, creating a statistical illusion of capability. PuMVR's 375 culturally grounded tasks across Punjabi's three scripts expose that models demonstrate Script Consistency Rates as low as 24.8%, meaning identical reasoning tasks fail when presented in different orthographies.
For AI developers and organizations deploying VLMs in multilingual markets, this research signals that performance claims require deeper scrutiny. The finding that visual input boosts absolute accuracy but doesn't close relative bias gaps suggests the problem operates at the representation level rather than being simply solvable through additional training data.
The proposed Script Consistency Rate metric provides a concrete tool for more equitable evaluation. As AI systems increasingly serve global populations, accounting for orthographic variation becomes essential infrastructure. Future model development in multilingual spaces will likely need to explicitly address script variation rather than treating it as a trivial implementation detail, reshaping how companies benchmark and compare capabilities across non-Latin writing systems.
- →Vision-Language Models show accuracy gaps up to 16% on identical visual reasoning tasks depending on script, exposing critical script-dependent bias.
- →Current multilingual benchmarks miss orthographic variation, creating statistical illusions of capability that fail in real-world deployment.
- →Visual input improves absolute performance but fails to close relative script bias, indicating representation-level rather than data-level problems.
- →Script Consistency Rate metric is proposed as a new standard for equitable multilingual AI evaluation.
- →Billions of multi-script language users face fractured model capabilities that current evaluation paradigms systematically overlook.