#dataset News & Analysis

83 articles tagged with #dataset. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

83 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

MammoExpert: Benchmarking Chain-of-Thought Reasoning in Mammography Diagnosis

MammoExpert introduces the first large-scale mammography dataset with Chain-of-Thought reasoning annotations, comprising 2,379 images across 67 histopathology subtypes. The dataset demonstrates significant improvements in breast lesion classification accuracy (4-7.1% gains) and provides a benchmark for interpretable AI diagnostic reasoning in medical imaging.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Linguistically Augmented Audio Speech Data (LinguAS)

Researchers introduce LinguAS, a dataset of 800+ audio samples annotated with linguistic features to improve detection of deepfaked and spoofed speech. Models trained on this linguistically-augmented data significantly outperform existing deepfake detection baselines, addressing a critical gap in audio forensics.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

Researchers introduce MMIO, a large-scale industrial dataset with 80K+ samples, and RTVP, a refined prompt method for zero-shot defect detection in manufacturing. The work addresses the gap between general-purpose Large Visual Language Models and industrial applications, achieving state-of-the-art performance through improved text-visual prompt interactions and domain adaptation.

AIBullisharXiv – CS AI · Jun 97/10

🧠

DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home

Researchers introduce DIYHealth Suite, a comprehensive framework including a 900K-sample multimodal dataset, adaptive foundation model, and benchmark for home-based health management powered by generative AI. The framework addresses critical gaps in making healthcare accessible outside clinical settings through standardized tools for diverse home care scenarios.

AIBullisharXiv – CS AI · May 297/10

🧠

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Researchers introduce AnyMo, a unified framework for conditional human motion generation that supports arbitrary modality combinations (text, speech, music, trajectory). The work is enabled by OmniHuMo, a large-scale dataset of 5,000+ hours of motion with precisely aligned multimodal annotations, addressing the critical bottleneck of training data scarcity in multimodal synthesis.

AIBullisharXiv – CS AI · May 127/10

🧠

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

Researchers introduce FactoryNet, the first universal pretraining dataset for industrial time-series data containing 51M datapoints across 23k task executions in robotic and machining domains. The dataset employs a novel S-E-F-C schema enabling cross-embodiment transfer and efficient anomaly detection, advancing toward industrial foundation models.

🏢 Meta

AIBullisharXiv – CS AI · May 127/10

🧠

WorldSpeech: A Multilingual Speech Corpus from Around the World

Researchers introduce WorldSpeech, a multilingual speech corpus containing 65,000 hours of aligned audio-transcript data across 76 languages, addressing the critical gap in ASR training data for low-resource languages. Fine-tuning existing ASR models on this dataset achieves an average 63.5% relative Word-Error-Rate reduction, significantly improving speech recognition accuracy for underrepresented languages.

AIBullisharXiv – CS AI · May 117/10

🧠

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Researchers have developed an automated framework to generate a large-scale dataset of 163,000 molecule-description pairs by combining rule-based chemical nomenclature parsing with LLM guidance, achieving 98.6% precision in aligning molecular structures with natural language descriptions. This addresses a critical bottleneck in training language models for chemistry applications where manual annotation is prohibitively expensive.

🏢 Hugging Face

AIBullisharXiv – CS AI · May 117/10

🧠

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Researchers introduce Video Understanding Reward Bench (VURB), a comprehensive benchmark with 2,100 preference pairs for evaluating video reward models, alongside VUP-35K, a large-scale dataset of 35,000 preference examples. Two new models, VideoDRM and VideoGRM, achieve state-of-the-art performance on video understanding tasks, advancing multimodal AI capabilities beyond text and images.

AINeutralarXiv – CS AI · May 97/10

🧠

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Researchers have developed TurnGate, a defense system that detects multi-turn dialogue attacks where malicious intent is distributed across multiple conversation turns rather than exposed in a single prompt. The study introduces the Multi-Turn Intent Dataset (MTID) and demonstrates that the system outperforms existing baselines while maintaining low false-positive refusal rates.

AINeutralarXiv – CS AI · Apr 77/10

🧠

AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

Researchers released AgenticFlict, a large-scale dataset analyzing merge conflicts in AI coding agent pull requests on GitHub. The study of 142K+ AI-generated pull requests from 59K+ repositories found a 27.67% conflict rate, highlighting significant integration challenges in AI-assisted software development.

AIBullisharXiv – CS AI · Mar 267/10

🧠

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Researchers released CUA-Suite, a comprehensive dataset featuring 55 hours of continuous video demonstrations across 87 desktop applications to train computer-use agents. The dataset addresses a critical bottleneck in developing AI agents that can automate complex desktop workflows, revealing current models struggle with ~60% task failure rates on professional applications.

AIBullisharXiv – CS AI · Mar 267/10

🧠

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Researchers have released DanQing, a large-scale Chinese vision-language dataset containing 100 million high-quality image-text pairs curated from Common Crawl data. The dataset addresses the bottleneck in Chinese VLP development and demonstrates superior performance compared to existing Chinese datasets across various AI tasks.

AINeutralarXiv – CS AI · Mar 117/10

🧠

AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

Researchers have developed an open-source benchmark dataset to evaluate AI systems' compliance with the EU AI Act, specifically focusing on NLP and RAG systems. The dataset enables automated assessment of risk classification, article retrieval, and question-answering tasks, achieving 0.87 and 0.85 F1-scores for prohibited and high-risk scenarios.

AIBullisharXiv – CS AI · Mar 56/10

🧠

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Researchers introduce ToolVQA, a large-scale multimodal dataset with 23K instances designed to improve AI models' ability to use external tools for visual question answering. The dataset features real-world contexts and multi-step reasoning tasks, with fine-tuned 7B models outperforming GPT-3.5-turbo on various benchmarks.

AINeutralarXiv – CS AI · Mar 57/10

🧠

ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound

Researchers have released ERDES, the first open-access dataset of ocular ultrasound videos for detecting retinal detachment and macular status using machine learning. The dataset addresses a critical gap in automated medical diagnosis by enabling AI models to classify retinal detachment severity, which is essential for determining surgical urgency.

AINeutralarXiv – CS AI · Mar 56/10

🧠

CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts

Researchers introduce CAM-LDS, a new dataset covering 81 cyber attack techniques to improve automated log analysis using Large Language Models. The study shows LLMs can correctly identify attack techniques in about one-third of cases, with adequate performance in another third, demonstrating potential for AI-powered cybersecurity analysis.

AINeutralarXiv – CS AI · Mar 46/102

🧠

AI-Generated Music Detection in Broadcast Monitoring

Researchers introduced AI-OpenBMAT, the first dataset designed for detecting AI-generated music in broadcast environments, revealing that existing detection models perform poorly when music appears as short excerpts or is masked by speech. The study found that state-of-the-art detection models' F1-scores dropped below 60% in challenging broadcast scenarios, highlighting significant limitations in current AI music detection technology.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Researchers introduce Robometer, a new framework for training robot reward models that combines progress tracking with trajectory comparisons to better learn from failed attempts. The system is trained on RBM-1M, a dataset of over one million robot trajectories including failures, and shows improved performance across diverse robotics applications.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Researchers introduce Kiwi-Edit, a new video editing architecture that combines instruction-based and reference-guided editing for more precise visual control. The team created RefVIE, a large-scale dataset for training, and achieved state-of-the-art results in controllable video editing through a unified approach that addresses limitations of natural language descriptions.

AIBullishHugging Face Blog · Aug 207/107

🧠

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

NVIDIA has released a massive 6 million sample multi-lingual reasoning dataset, representing a significant contribution to AI research and development. This dataset release could accelerate advances in AI reasoning capabilities across multiple languages and benefit the broader AI research community.

AINeutralarXiv – CS AI · Jun 236/10

🧠

BELDE: Building a Large-scale Earth-observation Land-cover Dataset for Europe

BELDE is a newly introduced large-scale dataset containing over 1 million RGB satellite image-segmentation pairs from Europe, designed to advance earth observation and land-cover segmentation models. The dataset achieves strong in-domain performance (83% F1 score) but reveals significant challenges in cross-geographic generalization, with accuracy dropping substantially on non-European regions.

AINeutralarXiv – CS AI · Jun 236/10

🧠

ToxSyn-PT: A Synthetic Fine-Grained Dataset of Minority-Targeted Toxic Language in Portuguese

Researchers introduce ToxSyn-PT, a large-scale Portuguese dataset for detecting hate speech targeting minority groups, featuring fine-grained annotations and non-toxic counterexamples absent in existing datasets. The study reveals that hate speech detection models trained on social media fail to generalize to minority-specific contexts, exposing critical gaps in current evaluation metrics and highlighting the need for specialized datasets in non-English languages.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 236/10

🧠

GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation

Researchers have developed GAPartManip, a large-scale dataset for training AI systems to manipulate articulated household objects by focusing on part-centric interactions rather than traditional depth perception. The dataset includes photo-realistic material variations and detailed annotations for interaction poses, demonstrating improved performance in both simulated and real-world robotic manipulation tasks.

AIBullisharXiv – CS AI · Jun 116/10

🧠

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

Researchers have developed PoetryQwen, a specialized language model fine-tuned for classical Chinese poetry analysis, along with a new 49,404-pair dataset called CCPoetry-49K. The model achieves 9.7% performance improvement over baseline Qwen2.5, demonstrating the effectiveness of domain-specific optimization for nuanced linguistic tasks.

Page 1 of 4Next →