y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark-gap News & Analysis

3 articles tagged with #benchmark-gap. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles
AIBearisharXiv – CS AI · May 77/10
🧠

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

A comprehensive study evaluating five multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant gap between benchmark performance and clinical applicability. While models achieved up to 42% accuracy on public datasets, performance dropped dramatically to 1.5-24.65% on actual hospital cases, highlighting critical limitations in deploying these systems for clinical decision-making.

🧠 GPT-4
AIBearisharXiv – CS AI · Apr 157/10
🧠

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.

AIBearisharXiv – CS AI · Apr 147/10
🧠

VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise

Researchers introduce VeriSim, an open-source framework that tests medical AI systems by injecting realistic patient communication barriers—such as memory gaps and health literacy limitations—into clinical simulations. Testing across seven LLMs reveals significant performance degradation (15-25% accuracy drop), with smaller models suffering 40% greater decline than larger ones, exposing a critical gap between standardized benchmarks and real-world clinical robustness.