#benchmark-dataset News & Analysis

46 articles tagged with #benchmark-dataset. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

46 articles

AIBullisharXiv – CS AI · 3d ago7/10

🧠

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

Researchers introduce VitalAgent, an AI framework that combines language models with tool-augmented reasoning to enable both reactive question answering and proactive monitoring of physiological data from wearable devices like ECG and PPG sensors. The framework achieves 30% improvement over baseline approaches and is validated against a new benchmark dataset (VitalBench) containing 1,862 QA pairs and 90+ hours of continuous biometric recordings.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER is a new AI framework that extends audio-visual large language models from 2D to 3D space, enabling spatial grounding and reasoning in physical environments through RGB-D observations and multi-channel audio. The researchers introduce Neural Intensity Vector (Neural IV) for enhanced directional audio analysis and release SpatialSceneQA, a 61k-sample benchmark for training and evaluation.

AIBearisharXiv – CS AI · 4d ago7/10

🧠

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Researchers introduce Deepfake-Eval-2024, a new benchmark dataset of real-world deepfakes collected from social media in 2024, revealing that state-of-the-art detection models experience dramatic performance drops of 45-50% compared to academic benchmarks. The findings underscore a critical gap between laboratory-validated deepfake detectors and their effectiveness against actual manipulated content in circulation.

AINeutralarXiv – CS AI · May 17/10

🧠

NanoKnow: How to Know What Your Language Model Knows

Researchers release NanoKnow, a benchmark dataset that reveals how large language models acquire and encode knowledge by leveraging nanochat's fully transparent pre-training data. The study demonstrates that LLM accuracy depends heavily on answer frequency in training data, and that parametric knowledge and external evidence serve complementary roles in model outputs.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Can Large Language Models Infer Causal Relationships from Real-World Text?

Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.

AINeutralarXiv – CS AI · Feb 277/108

🧠

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Researchers introduce MM-NeuroOnco, a large-scale multimodal dataset containing 24,726 MRI slices and 200,000 instructions for training AI models in brain tumor diagnosis. The benchmark reveals significant challenges in medical AI, with even advanced models like Gemini 3 Flash achieving only 41.88% accuracy on diagnostic questions.

AIBullisharXiv – CS AI · 16h ago6/10

🧠

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Researchers introduce SCALE, a self-improving web agent framework that uses adversarial roles and cognitive-aware exploration to autonomously adapt to complex web environments without relying on handcrafted pipelines or expensive expert data. The framework includes SCALE-Hop, a graph exploration strategy, and SCALE-20k, a 20,000-sample dataset from 19 real-world websites that demonstrates improved performance across multiple multimodal large language models.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Brain-IT-VQA: From Brain Signals to Answers

Researchers have developed Brain-IT-VQA, a framework that decodes visual question answers directly from fMRI brain signals with significantly improved accuracy over previous methods. The team also introduced NSD-VQA, a new benchmark dataset with 20 controlled question categories per image, enabling more reliable evaluation of how visual information is represented in the brain.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

Researchers introduce XXLTraffic and EvoXXLTraffic, new datasets spanning 27 years of California and Australian traffic sensor data that account for real-world network growth. Unlike existing benchmarks assuming fixed sensor networks, these datasets expose the challenge of forecasting across dynamically evolving road infrastructure with sensor growth rates exceeding 10,000%, and reveal that current state-of-the-art models fail to generalize under such conditions.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

Researchers introduce JMed48k, a comprehensive Japanese medical licensing benchmark containing 48,862 exam questions and 20,142 images to evaluate vision-language models across 11 healthcare professions. Testing 21 models reveals significant disparities in how effectively different AI systems leverage visual information, with proprietary models gaining substantially from images while medical-specific systems show limited visual utilization.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench introduces a privacy-protected multi-task benchmark dataset combining clinical assessments, learning trajectories, and treatment outcomes for pediatric social-communication research. The dataset integrates two cohorts (189 observational and 86 randomized controlled trial participants) to enable knowledge tracing, clinical prediction, and causal inference while maintaining pediatric data protection standards.

AINeutralarXiv – CS AI · 4d ago5/10

🧠

ChildEval: When large language models meet children's personalities

Researchers introduce ChildEval, a benchmark dataset containing 29K synthesized persona profiles to evaluate how large language models understand and respond to children's preferences aged 3-6. The work addresses a gap in LLM evaluation by testing whether AI systems can infer and follow child-specific preferences in extended conversations, with results showing that fine-tuning on the benchmark improves child-centered performance.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Researchers introduce MAVEN, a multi-agent framework that improves text-to-video generation's ability to accurately represent multiple cultures within single prompts. The team contributes a new benchmark dataset of 243 culturally grounded prompts across Chinese, American, and Romanian cultures, demonstrating that specialized agent-based prompt refinement significantly enhances cultural fidelity while maintaining visual quality.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Researchers introduce SuiChat-CN, a Chinese-language benchmark dataset for assessing suicide risk in group chat conversations using AI models. The dataset contains 13,312 contextual segments from Telegram, demonstrating that contextual information significantly improves risk detection accuracy compared to isolated message analysis.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

Researchers introduce TADDLE, an AI system that detects quality deficiencies in LLM-generated peer reviews by decomposing analysis into specialized tools and multi-label classification. The work addresses a growing problem in academic publishing where AI-written reviews are fluent but potentially flawed, backed by the first expert-annotated benchmark of 1,800 reviews across six defect categories.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

Researchers introduced EconCausal, a benchmark dataset of 10,490 annotated economic causal relationships from peer-reviewed studies, revealing that large language models struggle to properly condition predictions on changing contexts—achieving 88% accuracy in fixed scenarios but dropping to 41.3% when context shifts require reversing causal directions.

AINeutralarXiv – CS AI · May 126/10

🧠

KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

Researchers introduce KARMA-MV, a large-scale dataset of 37,737 multiple-choice questions derived from 2,682 YouTube music videos, designed to benchmark AI models' ability to reason about causal relationships between visual dynamics and musical structure. The dataset leverages LLM-based generation for scalability and proposes a causal knowledge graph approach to improve vision-language model performance on cross-modal audio-visual reasoning tasks.

AIBearisharXiv – CS AI · May 126/10

🧠

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

Researchers introduce FraudBench, a multimodal benchmark dataset designed to detect AI-generated fraudulent refund evidence in e-commerce, food delivery, and travel services. The study reveals that current AI detection systems struggle significantly with claim-conditioned fake-damage detection, with specialized detectors failing to reliably distinguish synthetic fraud from authentic evidence.

AINeutralarXiv – CS AI · May 126/10

🧠

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

Researchers introduce MC², a hybrid solver combining Monte Carlo methods with neural networks to solve elliptic PDEs 1000x faster than traditional approaches while maintaining high accuracy. The team also releases PDEZoo, a 2-million-PDE benchmark dataset that standardizes evaluation of finite-compute PDE solving, establishing that Monte Carlo errors are learnable and correctable through single-pass neural correction.

AINeutralarXiv – CS AI · May 126/10

🧠

HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

Researchers introduce HOME-KGQA, a new benchmark dataset for evaluating knowledge graph question answering systems on household activities using multimodal data. The dataset reveals significant performance gaps in current LLM-based KGQA methods, highlighting critical challenges for real-world deployment of AI systems that combine language models with structured knowledge.

AINeutralarXiv – CS AI · May 126/10

🧠

PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

Researchers introduced PrimeKG-CL, a benchmark dataset for continual graph learning built from nine biomedical databases with 129K+ nodes and 8.1M+ edges across two temporal snapshots (2021-2023). The work evaluates how different machine learning strategies handle evolving biomedical knowledge graphs, revealing that decoder choice and learning strategy interact significantly and that standard metrics fail to distinguish between retaining valid facts and forgetting outdated ones.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 126/10

🧠

TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

Researchers introduced TrajPrism, a comprehensive benchmark dataset combining 300K real urban trajectories with natural language annotations across three cities, enabling AI models to understand the alignment between physical travel paths and human descriptions of movement intent, constraints, and preferences.

AINeutralarXiv – CS AI · May 116/10

🧠

Learning CLI Agents with Structured Action Credit under Selective Observation

Researchers present a new approach to training CLI agents through reinforcement learning, introducing σ-Reveal for selective observation and A³ for credit assignment. The work addresses fundamental challenges in teaching AI systems to interact with command-line interfaces by leveraging structured action properties and proposing the ShellOps dataset for evaluation.

AINeutralarXiv – CS AI · May 116/10

🧠

TRACE: Tourism Recommendation with Accountable Citation Evidence

Researchers introduce TRACE, a benchmark dataset for evaluating tourism recommendation systems that combine multi-turn dialogue, verifiable review citations, and rejection recovery. The dataset reveals a significant gap in existing conversational recommender systems: LLMs excel at recall but cite weakly, while retrieval-based systems ground better but struggle with accuracy and adaptation.

AINeutralarXiv – CS AI · May 116/10

🧠

Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies

Researchers present CWE-BENCH-PYTHON, a large-scale benchmark demonstrating that poorly formulated prompts significantly increase the likelihood of LLMs generating insecure code. The study shows advanced prompting techniques like Chain-of-Thought can effectively mitigate these security risks, establishing prompt quality as a critical factor in AI-generated code safety.

Page 1 of 2Next →