y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

arXiv – CS AI|Zezheng Lin, Fengming Liu, Handi Li|
🤖AI Summary

Researchers challenge the assumption that the 'Translation Tax'—a uniform penalty in translated multilingual benchmarks—operates as a simple scalar. Through counterfactual analysis of English-to-Chinese translations, they find translation quality effects are heterogeneous, model-dependent, and item-specific rather than uniform across benchmarks.

Analysis

This research challenges a foundational assumption in multilingual AI evaluation: that translation introduces consistent, predictable biases. The 'Translation Tax' concept presumes translated benchmarks systematically inflate performance scores by preserving English-language cues, treating this as a uniform penalty. The authors test this hypothesis across three estimation methods and find stark disagreements—back-translation gaps prove small and fragile, cue-score calibration fails to predict item-level performance gains, and native model comparisons reveal model-family effects rather than benchmark-wide penalties.

The study's novel LLM-naturalization stress test isolates surface-form translation quality by holding semantic content constant while rewriting Chinese phrasing. After fixing a prompt-construction methodological error, the data reveal not a single tax but a dose-response relationship tied to linguistic residue—high-residue items (those with stronger translation artifacts) show measurable performance differences, while clean translations do not.

This matters significantly for AI development and evaluation standards. Benchmark designers and researchers comparing multilingual models have likely applied oversimplified correction assumptions, potentially misattributing performance gaps to translation when model architecture or training data explains the variance. The heterogeneous findings suggest that blanket adjustments or discounting of translated benchmark results are methodologically unsound. For the AI research community, this demands more granular, item-level analysis rather than scalar corrections. The authors' release of per-cell evidence, naturalization protocols, and a reporting checklist provides practical tools for improving future multilingual benchmark construction and interpretation.

Key Takeaways
  • Translation quality effects on benchmark scores are heterogeneous and model-dependent, not a uniform scalar penalty
  • High-residue items with stronger translation artifacts show dose-response performance patterns while clean translations do not
  • Common estimation methods for measuring translation bias show contradictory results, indicating current approaches are unreliable
  • Model-family effects drive performance differences more than benchmark-level translation issues
  • Researchers should adopt item-level analysis and the authors' provided reporting checklist for multilingual benchmark evaluation
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles