Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
Researchers demonstrate that fine-tuning Spanish biomedical embeddings with synthetic data generated by large language models significantly improves clinical code retrieval across multiple European languages. The two-stage retrieval system outperforms existing benchmarks like BioBERT-ST, particularly for non-English languages, addressing a critical gap in multilingual medical AI applications.
Clinical coding systems like ICD-10-CM represent critical infrastructure for healthcare documentation and billing worldwide, yet most semantic search models are optimized exclusively for English. This study reveals a systematic performance degradation when applying English-centric embeddings to non-English clinical retrieval tasks, a problem that has remained largely hidden within aggregate benchmarks that mask language-specific failures.
The researchers' approach leverages synthetic data generation through Gemini to create training pairs across six languages, then fine-tunes a Spanish biomedical encoder into a specialized retriever. This methodological innovation addresses a fundamental scarcity: high-quality, annotated clinical datasets in non-English languages remain expensive and difficult to obtain. The strategy proves effective, with the fine-tuned bi-encoder matching or exceeding BioBERT-ST baseline performance while dramatically improving non-English recall—Portuguese retrieval improves from 0.714 to 0.829 at R@5, representing an 16% gain.
For healthcare organizations in Spanish-speaking, Portuguese-speaking, and French-speaking regions, this work offers immediate practical value. The demonstrated recipe for building domain-specific retrievers from LLM-generated data reduces barriers to deployment, enabling smaller healthcare systems to build effective multilingual clinical search without massive proprietary datasets. The cross-encoder reranking stage introduces a strategic trade-off, sacrificing marginal English performance to substantially improve other languages—a clinically reasonable choice given the broader user populations served.
The research hints at broader trends in AI infrastructure: the maturation of synthetic data generation, the viability of language-specific fine-tuning over universal models, and the emerging economics where generating domain-specific data becomes cheaper than collecting human-annotated examples. This pattern likely extends beyond clinical coding to other specialized multilingual domains.
- →Fine-tuned Spanish biomedical embeddings with synthetic LLM data outperform English-centric BioBERT-ST on non-English clinical code retrieval by up to 16%.
- →A two-stage retriever combining bi-encoder and cross-encoder reranker achieves 0.822 R@5 overall with language-specific optimization trade-offs.
- →Synthetic data generation via large language models enables cost-effective creation of domain-specific training data for low-resource languages.
- →Portuguese retrieval improved dramatically from 0.714 to 0.829 R@5, demonstrating substantial gains for underserved medical AI applications.
- →The open recipe provided enables healthcare organizations to build specialized multilingual medical retrievers without proprietary large-scale annotated datasets.