🧠 AI⚪ NeutralImportance 6/10

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

arXiv – CS AI|Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce ClinicalBr, the first bilingual clinical benchmark using 2,892 real Brazilian Portuguese-English case reports to evaluate large language models. The study reveals that English-language advantages in clinical AI are task-dependent, with Portuguese performing comparably in differential diagnosis, exam recommendations, and treatment planning.

Analysis

The release of ClinicalBr addresses a critical gap in AI model evaluation: the overwhelming bias toward English-language benchmarks in clinical decision support. This research demonstrates that performance disparities between languages are not uniform across medical tasks, challenging assumptions about universal language gaps in specialized domains. In diagnosis retrieval, English maintains a 7.5-12.1 point accuracy advantage, but this advantage vanishes in three other key clinical tasks where Portuguese completeness scores are marginally higher, suggesting that model pre-training adequately represents non-English medical knowledge.

The finding that Brazilian-endemic tropical diseases are easier for models to evaluate than the general corpus contradicts assumptions that models struggle with regional health challenges. This indicates pre-training data includes sufficient representation of non-Western medical presentations. However, exam recommendation emerges as a significant weakness across all tested models (MedGemma-27B, Sabià-4, DeepSeek-R1, and o3-mini), with F1 scores below 0.10, indicating that clinical AI still has substantial limitations in practical recommendation systems regardless of language.

For the AI development community, ClinicalBr provides a crucial resource for building more equitable clinical systems across linguistic boundaries. The parallel corpus structure enables researchers to isolate language-specific performance gaps from task-specific limitations. This work signals that global healthcare AI deployment should prioritize task-specific evaluation frameworks rather than assuming uniform language-based deficiencies. Development teams building clinical systems for Portuguese-speaking regions can now rely on localized benchmarks to assess real-world performance.

Key Takeaways

→English language advantage in clinical AI is task-dependent, disappearing in differential diagnosis and treatment planning tasks
→ClinicalBr provides the first bilingual clinical benchmark with 2,892 real cases from 28 Brazilian medical journals across 18 specialties
→All tested models struggle with exam recommendation tasks, achieving F1 scores below 0.10 regardless of language
→Brazilian-endemic tropical conditions are adequately represented in current model pre-training, contrary to expectations
→Bilingual evaluation frameworks reveal that performance gaps previously attributed to language may actually reflect task-specific limitations