#benchmark-gap News & Analysis

4 articles tagged with #benchmark-gap. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBearisharXiv – CS AI · Jun 87/10

🧠

Re-Centering Humans in LLM Personalization

Researchers reveal a significant gap between synthetic and real-world performance in LLM personalization systems by analyzing 550 human conversations across three stages: attribute extraction, attribute selection, and response generation. The study finds that current models struggle with human-aligned personalization and that learned reward models fail to adequately capture human preferences, highlighting fundamental limitations in how AI systems understand and incorporate user information.

AIBearisharXiv – CS AI · May 77/10

🧠

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

A comprehensive study evaluating five multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant gap between benchmark performance and clinical applicability. While models achieved up to 42% accuracy on public datasets, performance dropped dramatically to 1.5-24.65% on actual hospital cases, highlighting critical limitations in deploying these systems for clinical decision-making.

🧠 GPT-4

AIBearisharXiv – CS AI · Apr 157/10

🧠

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.

AIBearisharXiv – CS AI · Apr 147/10

🧠

VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise

Researchers introduce VeriSim, an open-source framework that tests medical AI systems by injecting realistic patient communication barriers—such as memory gaps and health literacy limitations—into clinical simulations. Testing across seven LLMs reveals significant performance degradation (15-25% accuracy drop), with smaller models suffering 40% greater decline than larger ones, exposing a critical gap between standardized benchmarks and real-world clinical robustness.