MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
Researchers introduce MULTITEXTEDIT, a benchmark for evaluating text-in-image editing across 12 languages, revealing significant cross-lingual performance degradation in AI models. The study uncovers pronounced accuracy issues in non-English languages, particularly Hebrew and Arabic, highlighting the need for multilingual improvements in visual content creation AI.
MULTITEXTEDIT addresses a critical gap in AI evaluation methodology by shifting focus from English-dominant benchmarks to genuinely multilingual assessment. The benchmark's innovation lies not merely in scale but in methodological rigor—pairing language variants with shared visual bases isolates linguistic factors from visual ones, enabling precise cross-lingual comparison. The introduction of Language Script Fidelity (LSF), a specialized metric capturing script-specific errors, demonstrates recognition that semantic correctness differs fundamentally from visual rendering accuracy. This distinction matters because diacritics, RTL text order, and mixed-script rendering represent failure modes invisible to coarse metrics like BLEU scores.
The research contextualizes a broader industry problem: as text-in-image editing becomes central to content creation workflows, models trained predominantly on English data systematically underperform on typologically diverse languages. The pronounced degradation on Hebrew and Arabic suggests that script complexity and linguistic distance from training data correlate with failure rates. This gap has immediate practical implications for global users and developers building multilingual applications.
For the AI development community, these findings expose a critical blind spot in model evaluation practices. Teams benchmarking on English metrics may unknowingly deploy systems with severe functional limitations for non-English markets. The performance variance across languages—from robust Spanish/Dutch handling to degraded Hebrew/Arabic rendering—suggests that current architectural approaches lack language-agnostic robustness. Developers and research teams will likely prioritize multilingual fine-tuning and script-aware training protocols. The work establishes evaluation standards that future models must meet, effectively raising baseline expectations for production-grade systems serving global audiences.
- →All 12 evaluated models show cross-lingual performance degradation, with Hebrew and Arabic experiencing the largest accuracy drops.
- →The LSF metric captures script-level errors that traditional text-matching metrics miss, achieving 0.76 quadratic-weighted kappa agreement with human annotators.
- →Text accuracy and script fidelity emerge as primary failure modes rather than structural or layout issues, revealing script-specific vulnerabilities.
- →The benchmark spans 3,600 instances across 12 typologically diverse languages and 5 visual domains, establishing rigorous multilingual evaluation standards.
- →Models preserve global visual properties while distorting script-specific forms, indicating systematic encoding failures rather than random degradation.