#benchmark-dataset News & Analysis

83 articles tagged with #benchmark-dataset. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

83 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Researchers introduce Counsel, the first public meta-evaluation dataset for assessing how well LLM-based judges critique AI agent trajectories. The dataset addresses a critical bottleneck in agent evaluation by providing human-validated assessments of automated critique quality, enabling better calibration of evaluators at scale.

AIBullisharXiv – CS AI · Jun 237/10

🧠

ConnectomeBench2: A Unified Benchmark for Automated Connectomic Proofreading

Researchers released ConnectomeBench2, a unified benchmark dataset containing over 716,000 expert-labeled proofreading decisions for automated 3D brain reconstruction across four species. A Vision Transformer model trained on this dataset achieved human-level accuracy in identifying segmentation errors, advancing the automation of connectomic proofreading—a critical bottleneck in neuroscience research.

🏢 Hugging Face

AIBullisharXiv – CS AI · Jun 107/10

🧠

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

Researchers introduce MMClima, a large-scale multimodal framework containing 104k+ expert-validated QA pairs for climate science across text, video, and figures. The project benchmarks state-of-the-art multimodal AI models and releases a fine-tuned baseline model, evaluation tools, and dataset to standardize climate science AI evaluation.

AIBullisharXiv – CS AI · Jun 97/10

🧠

MedVision: Benchmarking Quantitative Medical Image Analysis

Researchers introduce MedVision, a large-scale benchmark dataset with 30.8 million image-annotation pairs designed to evaluate and improve vision-language models (VLMs) on quantitative medical image analysis tasks. The work demonstrates that current VLMs perform poorly on clinical quantitative reasoning—such as tumor measurement and joint angle assessment—but can be significantly improved through supervised and reinforcement fine-tuning.

AIBullisharXiv – CS AI · Jun 57/10

🧠

GenTI: Benchmarking LLMs for Autonomous IDPS Rule Generation for Unseen Attacks

Researchers introduce GenTI, an LLM-driven framework that automatically generates intrusion detection and prevention system (IDPS) rules for zero-day and unseen attacks. The benchmark dataset aggregates over 150,000 Snort/Suricata rules and 50,000 YARA signatures with structured cybersecurity intelligence, achieving 87.4% detection accuracy on unseen threats while reducing false positives from 8.5% to 2.3%.

AINeutralarXiv – CS AI · Jun 47/10

🧠

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Researchers introduce CounterFace, a synthetic face dataset with 11,821 counterfactual face pairs designed to evaluate face recognition systems across 20 facial attributes and 8 demographic factors. The fully automated pipeline addresses limitations in existing benchmarks by enabling fine-grained robustness testing across appearance variations like hairstyles and makeup, revealing significant performance disparities across commercial and open-source FR systems.

AIBullisharXiv – CS AI · Jun 27/10

🧠

FVSpec: Real-World Property-Based Tests as Lean Challenges

Researchers have created FVSpec, a benchmark dataset of 9,415 Lean 4 formal specifications derived from 2,772 real-world Python property-based tests, designed to evaluate AI models on automated formal software verification tasks. The work addresses a critical gap in AI-assisted code verification by providing open-source tools and data to advance AI's capability to formally prove software correctness.

AIBullisharXiv – CS AI · May 297/10

🧠

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER is a new AI framework that extends audio-visual large language models from 2D to 3D space, enabling spatial grounding and reasoning in physical environments through RGB-D observations and multi-channel audio. The researchers introduce Neural Intensity Vector (Neural IV) for enhanced directional audio analysis and release SpatialSceneQA, a 61k-sample benchmark for training and evaluation.

AIBullisharXiv – CS AI · May 297/10

🧠

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

Researchers introduce VitalAgent, an AI framework that combines language models with tool-augmented reasoning to enable both reactive question answering and proactive monitoring of physiological data from wearable devices like ECG and PPG sensors. The framework achieves 30% improvement over baseline approaches and is validated against a new benchmark dataset (VitalBench) containing 1,862 QA pairs and 90+ hours of continuous biometric recordings.

AIBearisharXiv – CS AI · May 287/10

🧠

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Researchers introduce Deepfake-Eval-2024, a new benchmark dataset of real-world deepfakes collected from social media in 2024, revealing that state-of-the-art detection models experience dramatic performance drops of 45-50% compared to academic benchmarks. The findings underscore a critical gap between laboratory-validated deepfake detectors and their effectiveness against actual manipulated content in circulation.

AINeutralarXiv – CS AI · May 17/10

🧠

NanoKnow: How to Know What Your Language Model Knows

Researchers release NanoKnow, a benchmark dataset that reveals how large language models acquire and encode knowledge by leveraging nanochat's fully transparent pre-training data. The study demonstrates that LLM accuracy depends heavily on answer frequency in training data, and that parametric knowledge and external evidence serve complementary roles in model outputs.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Can Large Language Models Infer Causal Relationships from Real-World Text?

Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.

AINeutralarXiv – CS AI · Feb 277/108

🧠

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Researchers introduce MM-NeuroOnco, a large-scale multimodal dataset containing 24,726 MRI slices and 200,000 instructions for training AI models in brain tumor diagnosis. The benchmark reveals significant challenges in medical AI, with even advanced models like Gemini 3 Flash achieving only 41.88% accuracy on diagnostic questions.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Researchers introduce Sarashina2.2-TTS, a Japanese-focused text-to-speech system trained on 361k hours of speech that addresses kanji polyphony challenges through scaled training and targeted data augmentation. The system achieves state-of-the-art performance on Japanese pronunciation while maintaining cross-lingual robustness, alongside a new benchmark for evaluating kanji reading accuracy.

AINeutralarXiv – CS AI · Jun 235/10

🧠

Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Researchers introduce Sarc7, a benchmark dataset for classifying seven types of sarcasm using large language models, with a novel emotion-based prompting technique that outperforms traditional zero-shot and few-shot approaches. The study demonstrates that Gemini 2.5 achieved the highest performance with an F1 score of 0.3664, while emotion-informed generation methods showed 38.46% improvement in human evaluation over baseline approaches.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 236/10

🧠

Learning Bug Context for PyTorch-to-JAX Translation with LLMs

Researchers introduce T2J, a benchmark dataset of PyTorch-to-JAX translation bugs paired with developer fixes, addressing the challenge of translating deep-learning code between frameworks. By training LLMs on this curated bug-fix data through in-context learning, they achieve up to 20% improvement in translation accuracy, demonstrating that domain-specific bug datasets can significantly enhance code generation reliability.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 236/10

🧠

Chem2Gen-Bench: Benchmarking Chemical-to-Genetic Translation in Perturbation Response Space

Researchers introduce Chem2Gen-Bench, a comprehensive benchmark dataset containing over 1.3 million chemical and genetic perturbation profiles designed to evaluate how accurately computational models can translate chemical perturbations into genetic responses. The study reveals that while translation between these perturbation types is measurable, it remains heterogeneous across different cellular contexts, and current foundation-model embeddings don't consistently outperform simpler baseline approaches.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Researchers introduce a comprehensive framework for detecting hallucinations in long-form language model outputs through fine-grained uncertainty quantification, finding that simpler claim-level consistency methods outperform complex alternatives. The study provides practical guidance for improving factuality in extended LLM generations across STEM and geography domains.

AINeutralarXiv – CS AI · Jun 236/10

🧠

CAOA -- Completion-Assisted Object-CAD Alignment

Researchers introduce CAOA, a method for aligning CAD models to real-world objects in 3D indoor scans by combining point cloud completion with symmetry-aware pose estimation. The approach achieves 17% accuracy improvement over existing methods and introduces S2C-Completion, a new benchmark dataset of 8,500+ annotated object-CAD pairs for advancing 3D reconstruction tasks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning

Researchers introduce MotionHalluc, a benchmark dataset for evaluating how AI models hallucinate when analyzing motion differences between paired videos. The study reveals that large multimodal models struggle with directional, attributional, and temporal hallucinations in motion reasoning, but shows that injecting explicit kinematic measurements can improve accuracy by 10.6%.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

Researchers have developed a benchmark for evaluating efficient multimodal language models on pulmonary embolism diagnosis and risk assessment using a dataset of 23,248 CTPA studies. The study demonstrates that compact models like Gemma4 perform significantly better when combining imaging and electronic health record data, with diagnostic tasks outperforming prognostic predictions.

AINeutralarXiv – CS AI · Jun 196/10

🧠

NRITYAM: Language Models Meet Art and Heritage of Dance

Researchers have introduced NRITYAM, a comprehensive multilingual benchmark dataset containing 9,260 question-answer pairs across 12 languages designed to evaluate how well language models understand global dance traditions and cultural heritage. Developed in collaboration with native dance artists and speakers, the dataset addresses a critical gap in AI evaluation by testing cultural comprehension beyond Western-centric knowledge, establishing new standards for assessing AI systems' ability to reason about traditional performing arts.

AINeutralarXiv – CS AI · Jun 116/10

🧠

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Researchers introduce BioDivergence, a new evaluation framework that distinguishes between genuine contradictions and context-dependent divergences in biomedical research claims. The framework includes a six-class taxonomy and 13-axis ontology to capture why studies produce seemingly conflicting results, with a released benchmark of 11,865 claim pairs showing that current NLI models struggle with contextual understanding.

AINeutralarXiv – CS AI · Jun 106/10

🧠

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Researchers introduce ASyMOB, a 35,368-problem benchmark dataset for evaluating large language models on symbolic mathematics tasks. The dataset uses systematic perturbations to test genuine reasoning rather than pattern memorization, revealing that most models fail under minor problem variations while hybrid LLM-computer algebra system approaches show promise for scientific computing applications.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Expert-Level Crisis Detection in Mental Health Conversations

Researchers introduce CRADLE-Dialogue, a clinician-annotated benchmark dataset with 600 dialogues for detecting mental health crises in real-time conversations. The study reveals that identifying when risk emerges in multi-turn dialogues is significantly harder than recognizing risk exists, with models achieving only 40-60% F1 scores, and releases a 32B-parameter model competitive with proprietary alternatives.

Page 1 of 4Next →