y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#evaluation-metrics News & Analysis

30 articles tagged with #evaluation-metrics. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

30 articles
AINeutralarXiv – CS AI · Mar 36/107
🧠

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

Researchers introduce MC-Search, the first benchmark for evaluating agentic multimodal retrieval-augmented generation (MM-RAG) systems with long, structured reasoning chains. The benchmark reveals systematic issues in current multimodal large language models and introduces Search-Align, a training framework that improves planning and retrieval accuracy.

AINeutralarXiv – CS AI · Mar 36/107
🧠

Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Researchers developed an event-based evaluation framework for LLM-generated clinical summaries of remote monitoring data, revealing that models with high semantic similarity often fail to capture clinically significant events. A vision-based approach using time-series visualizations achieved the best clinical event alignment with 45.7% abnormality recall.

$NEAR
AINeutralarXiv – CS AI · Feb 276/107
🧠

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Researchers have developed SPM-Bench, a PhD-level benchmark for testing large language models on scanning probe microscopy tasks. The benchmark uses automated data synthesis from scientific papers and introduces new evaluation metrics to assess AI reasoning capabilities in specialized scientific domains.

AINeutralarXiv – CS AI · Mar 94/10
🧠

Conditioning LLMs to Generate Code-Switched Text

Researchers developed a methodology to fine-tune large language models (LLMs) for generating code-switched text between English and Spanish by back-translating natural code-switched sentences into monolingual English. The study found that fine-tuning significantly improves LLMs' ability to generate fluent code-switched text, and that LLM-based evaluation methods align better with human preferences than traditional metrics.

← PrevPage 2 of 2