🧠 AI⚪ NeutralImportance 6/10

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

arXiv – CS AI|Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Sci-Rho, a multilingual benchmark comprising 42,420 visually-grounded STEM problem instances across seven languages designed to test the robustness of vision-language models. The study reveals significant gaps between average and worst-case accuracy, with smaller models showing greater performance degradation across languages while larger proprietary models demonstrate better robustness.

Analysis

Sci-Rho addresses a critical gap in AI evaluation methodology by moving beyond static benchmarks to dynamic, robust assessment frameworks. Existing STEM benchmarks have focused primarily on mathematical reasoning in English without visual grounding, limiting their ability to assess real-world model reliability. This research introduces 4,242 expert-crafted problem templates with programmatic variation of numerical values, visual patterns, and geometric elements, enabling systematic evaluation of model consistency rather than mere average performance.

The benchmark's multilingual scope reflects growing recognition that model robustness varies substantially across linguistic contexts. The evaluation of 17 state-of-the-art VLMs reveals a troubling pattern: while proprietary and larger models maintain consistent performance across problem variations, smaller models exhibit pronounced degradation. This gap between worst-case and average accuracy has significant implications for practical deployment, where models must perform reliably on unseen variations rather than optimized test cases.

The attention head analysis demonstrates cross-lingual variation in how VLMs allocate computational resources between image and text tokens, suggesting fundamental differences in how models process multilingual visual content. This finding explains performance disparities and highlights architectural limitations rather than mere training data insufficiencies. For developers and researchers, Sci-Rho provides a robust evaluation framework that better predicts real-world performance than traditional benchmarks. The work establishes that current evaluation practices mask substantial brittleness in widely-deployed models, particularly for non-English and resource-constrained applications.

Key Takeaways

→Sci-Rho introduces 42,420 dynamically-generated STEM problem instances across seven languages to measure true model robustness beyond average performance metrics.
→Significant gap exists between worst-case accuracy and average accuracy in state-of-the-art VLMs, with smaller models showing greater performance degradation across variations.
→Larger proprietary models demonstrate superior cross-lingual robustness compared to smaller models, indicating scaling benefits for multilingual reliability.
→Attention head analysis reveals substantial cross-lingual variation in how vision-language models allocate computational focus between image and text tokens.
→Current static benchmarks mask model brittleness, necessitating dynamic evaluation frameworks for practical deployment assessment.