y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#evaluation-metrics News & Analysis

11 articles tagged with #evaluation-metrics. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AINeutralarXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

Researchers propose SemKey, a novel framework that addresses key limitations in EEG-to-text decoding by preventing hallucinations and improving semantic fidelity through decoupled guidance objectives. The system redesigns neural encoder-LLM interaction and introduces new evaluation metrics beyond BLEU scores to achieve state-of-the-art performance in brain-computer interfaces.

AINeutralarXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Researchers introduced InEdit-Bench, the first evaluation benchmark specifically designed to test image editing models' ability to reason through intermediate logical pathways in multi-step visual transformations. Testing 14 representative models revealed significant shortcomings in handling complex scenarios requiring dynamic reasoning and procedural understanding.

AIBullisharXiv โ€“ CS AI ยท 4d ago6/10
๐Ÿง 

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

Researchers propose Interactive ASR, a new framework that combines semantic-aware evaluation using LLM-as-a-Judge with multi-turn interactive correction to improve automatic speech recognition beyond traditional word error rate metrics. The approach simulates human-like interaction, enabling iterative refinement of recognition outputs across English, Chinese, and code-switching datasets.

AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

AI Planning Framework for LLM-Based Web Agents

Researchers introduce a formal planning framework that maps LLM-based web agents to traditional search algorithms, enabling better diagnosis of failures in autonomous web tasks. The study compares different agent architectures using novel evaluation metrics and a dataset of 794 human-labeled trajectories from WebArena benchmark.

AINeutralarXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

Researchers introduce ICR (Inductive Conceptual Rating), a new qualitative metric for evaluating meaning in large language model text summaries that goes beyond simple word similarity. The study found that while LLMs achieve high linguistic similarity to human outputs, they significantly underperform in semantic accuracy and capturing contextual meanings.

AINeutralarXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

Researchers introduce MC-Search, the first benchmark for evaluating agentic multimodal retrieval-augmented generation (MM-RAG) systems with long, structured reasoning chains. The benchmark reveals systematic issues in current multimodal large language models and introduces Search-Align, a training framework that improves planning and retrieval accuracy.

AINeutralarXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Researchers developed an event-based evaluation framework for LLM-generated clinical summaries of remote monitoring data, revealing that models with high semantic similarity often fail to capture clinically significant events. A vision-based approach using time-series visualizations achieved the best clinical event alignment with 45.7% abnormality recall.

$NEAR
AINeutralarXiv โ€“ CS AI ยท Feb 276/107
๐Ÿง 

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Researchers have developed SPM-Bench, a PhD-level benchmark for testing large language models on scanning probe microscopy tasks. The benchmark uses automated data synthesis from scientific papers and introduces new evaluation metrics to assess AI reasoning capabilities in specialized scientific domains.

AINeutralarXiv โ€“ CS AI ยท Mar 94/10
๐Ÿง 

Conditioning LLMs to Generate Code-Switched Text

Researchers developed a methodology to fine-tune large language models (LLMs) for generating code-switched text between English and Spanish by back-translating natural code-switched sentences into monolingual English. The study found that fine-tuning significantly improves LLMs' ability to generate fluent code-switched text, and that LLM-based evaluation methods align better with human preferences than traditional metrics.