AINeutralarXiv – CS AI · 10h ago7/10
🧠
MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies
Researchers introduce MedMeta, a benchmark evaluating how well large language models can synthesize conclusions from medical meta-analyses using only study abstracts. The study reveals that retrieval-augmented generation (RAG) significantly outperforms parametric-only approaches, but all current models struggle with evidence synthesis and fail to properly reject contradictory findings, achieving only marginally above-average performance even under ideal conditions.