#llm-benchmark News & Analysis

9 articles tagged with #llm-benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBearisharXiv – CS AI · Mar 277/10

🧠

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Researchers introduced CPGBench, a benchmark evaluating how well Large Language Models detect and follow clinical practice guidelines in healthcare conversations. The study found that while LLMs can detect 71-90% of clinical recommendations, they only adhere to guidelines 22-63% of the time, revealing significant gaps for safe medical deployment.

AINeutralarXiv – CS AI · Mar 46/103

🧠

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

Researchers released the ERI benchmark, a comprehensive dataset spanning 9 engineering fields and 55 subdomains to evaluate large language models' engineering capabilities. The benchmark tested 7 LLMs across 57,750 records, revealing a clear three-tier performance structure with frontier models like GPT-5 and Claude Sonnet 4 significantly outperforming mid-tier and smaller models.

AIBearisharXiv – CS AI · Jun 196/10

🧠

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

Researchers introduce BIM-Edit, a benchmark that evaluates large language models on their ability to edit existing Building Information Models in IFC format based on natural language instructions. The benchmark reveals significant capability gaps, with the best-performing LLM achieving only 49.5% accuracy and none solving more than 3.4% of tasks, highlighting that current AI systems struggle with the semantic preservation and relational understanding required for professional engineering workflows.

AINeutralarXiv – CS AI · Jun 26/10

🧠

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

Researchers introduce RoleCDE, a benchmark for evaluating role-playing agents in large language models, revealing a 'Role Value Decoupling' phenomenon where LLMs default to alignment-oriented decisions over role-specific values when conflicts arise. Fine-tuning with RoleCDE data effectively mitigates this behavior while preserving general performance.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Truth, Trust, and Trouble: Medical AI on the Edge

Researchers benchmarked open-source LLMs for medical question-answering, evaluating AlpaCare-13B, BioMistral-7B-DARE, and Mistral-7B across accuracy, safety, and helpfulness metrics. Results reveal fundamental trade-offs between factual reliability and harm prevention in medical AI systems, with implications for deploying these models in clinical settings.

AINeutralarXiv – CS AI · May 286/10

🧠

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

Researchers introduce AssertLLM2, an open-source benchmark containing 83 real-world hardware designs to evaluate how well Large Language Models can automatically generate formal SystemVerilog Assertions from specifications. The benchmark uniquely incorporates buggy RTL variants to assess both bug prevention and bug detection capabilities, establishing more rigorous evaluation standards for LLM-assisted hardware verification.

AIBullisharXiv – CS AI · May 286/10

🧠

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

Researchers introduce PolyBench, a benchmark dataset containing 125K+ polymer design tasks backed by 13M data points, along with a knowledge-augmented reasoning method to improve LLM performance in materials science. Small and mid-sized language models trained on PolyBench achieve competitive results with frontier models, demonstrating practical advancement in AI4Science applications.

AIBullishHugging Face Blog · Aug 16/107

🧠

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

3LM introduces a new benchmark specifically designed to evaluate Arabic Large Language Models (LLMs) in STEM subjects and coding tasks. This benchmark addresses the gap in Arabic language evaluation tools for technical domains, providing a standardized way to assess AI model performance in Arabic scientific and programming contexts.

AIBullishGoogle Research Blog · Sep 245/104

🧠

AfriMed-QA: Benchmarking large language models for global health

AfriMed-QA introduces a new benchmark for evaluating large language models' performance in global health contexts, specifically focusing on African healthcare scenarios. This research addresses the need for culturally relevant AI assessment tools in medical applications for underrepresented regions.