y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-benchmark News & Analysis

4 articles tagged with #llm-benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles
AIBearisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Researchers introduced CPGBench, a benchmark evaluating how well Large Language Models detect and follow clinical practice guidelines in healthcare conversations. The study found that while LLMs can detect 71-90% of clinical recommendations, they only adhere to guidelines 22-63% of the time, revealing significant gaps for safe medical deployment.

AINeutralarXiv โ€“ CS AI ยท Mar 46/103
๐Ÿง 

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

Researchers released the ERI benchmark, a comprehensive dataset spanning 9 engineering fields and 55 subdomains to evaluate large language models' engineering capabilities. The benchmark tested 7 LLMs across 57,750 records, revealing a clear three-tier performance structure with frontier models like GPT-5 and Claude Sonnet 4 significantly outperforming mid-tier and smaller models.

AIBullishHugging Face Blog ยท Aug 16/107
๐Ÿง 

๐Ÿ“š 3LM: A Benchmark for Arabic LLMs in STEM and Code

3LM introduces a new benchmark specifically designed to evaluate Arabic Large Language Models (LLMs) in STEM subjects and coding tasks. This benchmark addresses the gap in Arabic language evaluation tools for technical domains, providing a standardized way to assess AI model performance in Arabic scientific and programming contexts.

AIBullishGoogle Research Blog ยท Sep 245/104
๐Ÿง 

AfriMed-QA: Benchmarking large language models for global health

AfriMed-QA introduces a new benchmark for evaluating large language models' performance in global health contexts, specifically focusing on African healthcare scenarios. This research addresses the need for culturally relevant AI assessment tools in medical applications for underrepresented regions.