DEPART: DEcomposing PARiTy across Multilingual LLMs
Researchers introduce DEPART, a Bayesian framework that systematically decomposes performance disparities across multilingual large language models into interpretable components. The study reveals that language features and representational similarity to English explain 79-92% of variance, with model identity dominating NLU tasks while benchmark-model interactions drive reasoning task differences.
The multilingual AI research landscape has long grappled with a critical transparency gap: performance leaderboards report per-language accuracy metrics without explaining the underlying mechanisms driving disparities. This study addresses that gap through rigorous statistical methodology, moving beyond surface-level benchmarking into causal analysis. By applying distribution-free statistical tests and hierarchical Bayesian decomposition, the researchers establish that language performance gaps are systematic rather than noise-driven, lending credibility to their findings.
The framework's key insight—that English representational similarity emerges as the dominant predictor of multilingual performance—highlights a structural bias in current model training paradigms. Most large language models are predominantly trained on English data, creating an implicit linguistic hierarchy. The divergence between NLU and reasoning tasks reveals fundamental architectural differences: understanding tasks show strong model-dependent variance (66.7%), suggesting certain architectures naturally capture language nuance, while reasoning tasks exhibit benchmark-model interaction dominance (46.3%), indicating that task design significantly influences cross-language generalization.
For practitioners developing multilingual systems, this research provides actionable diagnostics. Rather than treating language disparities as inevitable, teams can now identify whether specific performance gaps stem from representational distance, model architecture limitations, or benchmark design flaws. This shifts multilingual evaluation from descriptive reporting to interventional strategy. The 79-92% variance explanation through observable features suggests concrete optimization paths: refining tokenization for underrepresented scripts, adjusting training data composition by language family, or redesigning benchmarks to reduce benchmark-model interaction effects. Researchers and AI developers should expect increasing pressure to explain rather than merely report multilingual performance variations.
- →English representational similarity is the dominant predictor of multilingual performance across both NLU and reasoning tasks
- →Observable language features (script, family, typological distance) explain 79% of variance in understanding tasks and 92% in reasoning
- →Model identity drives 66.7% of variance in NLU tasks while benchmark-model interactions dominate reasoning performance at 46.3%
- →Statistical tests confirm that multilingual performance disparities are systematic, not sampling artifacts
- →The framework provides practitioners concrete diagnostic levers to target root causes of language-specific performance gaps