#benchmark-dataset News & Analysis

83 articles tagged with #benchmark-dataset. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

83 articles

AINeutralarXiv – CS AI · Jun 106/10

🧠

Expert-Level Crisis Detection in Mental Health Conversations

Researchers introduce CRADLE-Dialogue, a clinician-annotated benchmark dataset with 600 dialogues for detecting mental health crises in real-time conversations. The study reveals that identifying when risk emerges in multi-turn dialogues is significantly harder than recognizing risk exists, with models achieving only 40-60% F1 scores, and releases a 32B-parameter model competitive with proprietary alternatives.

AIBullisharXiv – CS AI · Jun 96/10

🧠

MedicalRec: Medical recommender system for image classification without retraining

Researchers have developed MedicalRec, a transformer-based recommender system that identifies optimal deep learning models for medical image classification tasks without requiring retraining. The system leverages a new dataset (MedicalRec-Bench) containing over 5,000 model performance records across five medical imaging domains, achieving a 75.5% HitRate@100 and addressing the computational waste inherent in trial-and-error model selection.

AINeutralarXiv – CS AI · Jun 96/10

🧠

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

Researchers introduce EgoTactile, a new benchmark and AI framework for estimating hand grasp pressure from egocentric video without intrusive hardware sensors. The work combines vision-based deep learning with diffusion models to infer tactile information for VR and robotic applications, achieving strong generalization to real-world scenarios.

AINeutralarXiv – CS AI · Jun 96/10

🧠

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

Researchers introduce HA-VLN 2.0, a benchmark for vision-and-language navigation that explicitly incorporates human-aware constraints in both discrete and continuous environments. The study reveals significant performance degradation in leading navigation agents when confronted with dynamic multi-human interactions, emphasizing the critical need for social-awareness modeling in autonomous navigation systems.

AINeutralarXiv – CS AI · Jun 96/10

🧠

XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

Researchers introduce XCR-Bench, a benchmark dataset for evaluating cross-cultural reasoning in large language models, containing 4,100 parallel sentences and 1,098 culture-specific items across three reasoning tasks. The study reveals that state-of-the-art multilingual LLMs consistently fail to properly identify and adapt culturally sensitive content, exposing systematic biases and gaps in cultural competency.

AINeutralarXiv – CS AI · Jun 86/10

🧠

TSAQA: Time Series Analysis Question And Answering Benchmark

Researchers introduce TSAQA, a comprehensive benchmark for evaluating time series analysis capabilities in large language models across six diverse tasks and 210k samples. Current LLMs struggle significantly with temporal analysis, with even top commercial models achieving only 65% accuracy, revealing substantial gaps in their ability to handle complex time series reasoning.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 56/10

🧠

I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition

Researchers introduce Query Retrieve Conclude, a zero-shot framework that improves meme understanding by identifying knowledge gaps, retrieving current web evidence, and synthesizing grounded background knowledge. The approach addresses limitations of existing methods that rely on outdated or incomplete parametric knowledge, demonstrating improvements across meme understanding and detection tasks using a new benchmark dataset of 2024-2026 memes.

AINeutralarXiv – CS AI · Jun 56/10

🧠

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

Researchers introduced DisasterBench, a multimodal AI benchmark designed to improve UAV-based disaster response by testing reasoning across 14 disaster types and 9 response-critical tasks. They also developed DisasterVL, a lightweight 2B-parameter model that achieves GPT-4o-level reasoning accuracy while operating efficiently on edge devices with limited computational resources.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 56/10

🧠

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

Researchers have developed a benchmark dataset and evaluation framework for extracting data snapshots (figures and tables) from institutional documents like World Bank reports. The study reveals that current open-source layout detection models fail to generalize effectively to operational documents, struggling to distinguish analytical from non-analytical content and often fragmenting composite visual artifacts.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 46/10

🧠

LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

LCSHBench introduces the first large-scale public benchmark for Library of Congress Subject Heading assignment, comprising 22,346 multilingual books with consensus-validated labels from three major university libraries. The dataset reveals that while libraries agree on conceptual topics 93% of the time, they differ in exact heading assignments 39.4% of the time, enabling more nuanced evaluation of automated cataloging systems.

AINeutralarXiv – CS AI · Jun 26/10

🧠

TECCI: Tricky Edits of Collected and Curated Images

Researchers introduce TECCI, a new benchmark dataset for evaluating text-guided image editing models, containing 7,550 image-instruction pairs across challenging edit types. Human evaluations reveal that leading image editors achieve only 22% success rates, with models struggling most on spatial reasoning and creative edits while excelling at color adjustments.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 26/10

🧠

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

Researchers introduce HAIM, a new dataset and benchmark for detecting AI integration across music production workflows, moving beyond binary AI-or-human classification to track granular stages of AI intervention including hybrid and mastered content. The work exposes critical limitations in current AI detection systems as generative music platforms like Suno and Udio achieve human-quality output.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

Researchers introduce MIDI, a multilingual idiom dataset covering 18 languages across resource tiers, revealing that state-of-the-art NLP models struggle significantly with idiomatic expressions—particularly in low-resource languages and when interpreting literal meanings. The findings expose fundamental gaps in how current AI systems handle contextual language nuance across different linguistic communities.

AIBullisharXiv – CS AI · Jun 26/10

🧠

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

Researchers have released MGRegBench, the first large-scale public dataset for mammography image registration with over 5,000 image pairs and 100 manually annotated landmarks. This addresses a critical gap in medical AI research by enabling standardized, reproducible benchmarking of registration methods across classical, learning-based, and deep learning approaches.

🏢 Meta

AIBullisharXiv – CS AI · Jun 26/10

🧠

Multimodal Music Recommendation System using LLMs

Researchers propose a multimodal music recommendation system that enriches collaborative filtering with audio embeddings, lyric analysis, and LLM-generated semantic metadata. The framework demonstrates significant performance improvements over traditional ID-only baselines, achieving up to 95% recall gains, while revealing that naive multimodal fusion presents integration challenges.

AINeutralarXiv – CS AI · Jun 26/10

🧠

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

Researchers introduce SkyShield, the first monocular semantic occupancy benchmark for low-altitude UAV autonomy below 20 meters, addressing a critical gap in aerial safety perception. The dataset includes 36K annotated samples with 6-DoF pose tracking and a new safety-aware evaluation metric (KAR-mIoU) that prioritizes collision-critical risks over traditional accuracy measures.

AINeutralarXiv – CS AI · Jun 26/10

🧠

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Researchers introduce Multi-temporal Referring Segmentation (MTRS), a new computer vision task that combines temporal reasoning with language-guided image segmentation. They create MTRefSeg-21K, the first benchmark dataset with 21,000 annotated image triplets, and develop MTRefSeg-R1, an LVLM framework that outperforms existing models by learning temporal-change perception before fine-tuning on language-grounded tasks.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Researchers introduce SCALE, a self-improving web agent framework that uses adversarial roles and cognitive-aware exploration to autonomously adapt to complex web environments without relying on handcrafted pipelines or expensive expert data. The framework includes SCALE-Hop, a graph exploration strategy, and SCALE-20k, a 20,000-sample dataset from 19 real-world websites that demonstrates improved performance across multiple multimodal large language models.

AINeutralarXiv – CS AI · Jun 16/10

🧠

ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

ReTabAD introduces a new benchmark dataset for tabular anomaly detection that incorporates semantic context through textual metadata, addressing a gap where existing datasets lack domain knowledge. The research provides 20 enriched datasets, implementations of classical and LLM-based detection algorithms, and demonstrates that semantic context improves both detection performance and interpretability.

AINeutralarXiv – CS AI · May 296/10

🧠

From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

Researchers introduce XXLTraffic and EvoXXLTraffic, new datasets spanning 27 years of California and Australian traffic sensor data that account for real-world network growth. Unlike existing benchmarks assuming fixed sensor networks, these datasets expose the challenge of forecasting across dynamically evolving road infrastructure with sensor growth rates exceeding 10,000%, and reveal that current state-of-the-art models fail to generalize under such conditions.

AINeutralarXiv – CS AI · May 296/10

🧠

Brain-IT-VQA: From Brain Signals to Answers

Researchers have developed Brain-IT-VQA, a framework that decodes visual question answers directly from fMRI brain signals with significantly improved accuracy over previous methods. The team also introduced NSD-VQA, a new benchmark dataset with 20 controlled question categories per image, enabling more reliable evaluation of how visual information is represented in the brain.

AINeutralarXiv – CS AI · May 296/10

🧠

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

Researchers introduce JMed48k, a comprehensive Japanese medical licensing benchmark containing 48,862 exam questions and 20,142 images to evaluate vision-language models across 11 healthcare professions. Testing 21 models reveals significant disparities in how effectively different AI systems leverage visual information, with proprietary models gaining substantially from images while medical-specific systems show limited visual utilization.

AINeutralarXiv – CS AI · May 286/10

🧠

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Researchers introduce SuiChat-CN, a Chinese-language benchmark dataset for assessing suicide risk in group chat conversations using AI models. The dataset contains 13,312 contextual segments from Telegram, demonstrating that contextual information significantly improves risk detection accuracy compared to isolated message analysis.

AIBullisharXiv – CS AI · May 286/10

🧠

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench introduces a privacy-protected multi-task benchmark dataset combining clinical assessments, learning trajectories, and treatment outcomes for pediatric social-communication research. The dataset integrates two cohorts (189 observational and 86 randomized controlled trial participants) to enable knowledge tracing, clinical prediction, and causal inference while maintaining pediatric data protection standards.

AINeutralarXiv – CS AI · May 285/10

🧠

ChildEval: When large language models meet children's personalities

Researchers introduce ChildEval, a benchmark dataset containing 29K synthesized persona profiles to evaluate how large language models understand and respond to children's preferences aged 3-6. The work addresses a gap in LLM evaluation by testing whether AI systems can infer and follow child-specific preferences in extended conversations, with results showing that fine-tuning on the benchmark improves child-centered performance.

← PrevPage 2 of 4Next →