🧠 AI🟢 BullishImportance 6/10

VERT: Reliable LLM Judges for Radiology Report Evaluation

arXiv – CS AI|Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens, Asma Ben Abacha|April 7, 2026 at 04:00 AM

🤖AI Summary

Researchers introduced VERT, a new LLM-based metric for evaluating radiology reports that shows up to 11.7% better correlation with radiologist judgments compared to existing methods. The study demonstrates that fine-tuned smaller models can achieve significant performance gains while reducing inference time by up to 37.2 times.

Key Takeaways

→VERT outperforms existing LLM-based radiology evaluation metrics like RadFact, GREEN, and FineRadScore by up to 11.7%.
→Fine-tuning Qwen3 30B with only 1,300 training samples achieved up to 25% performance gains.
→The fine-tuned model reduced inference time by up to 37.2 times compared to larger models.
→The research validates LLM-based judges across multiple radiology modalities and anatomies beyond chest X-rays.
→Lightweight model adaptation can achieve reliable radiology report evaluation without requiring massive computational resources.