y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#dataset News & Analysis

63 articles tagged with #dataset. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

63 articles
AIBullisharXiv – CS AI · 2d ago7/10
🧠

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Researchers introduce AnyMo, a unified framework for conditional human motion generation that supports arbitrary modality combinations (text, speech, music, trajectory). The work is enabled by OmniHuMo, a large-scale dataset of 5,000+ hours of motion with precisely aligned multimodal annotations, addressing the critical bottleneck of training data scarcity in multimodal synthesis.

AIBullisharXiv – CS AI · May 127/10
🧠

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

Researchers introduce FactoryNet, the first universal pretraining dataset for industrial time-series data containing 51M datapoints across 23k task executions in robotic and machining domains. The dataset employs a novel S-E-F-C schema enabling cross-embodiment transfer and efficient anomaly detection, advancing toward industrial foundation models.

🏢 Meta
AIBullisharXiv – CS AI · May 127/10
🧠

WorldSpeech: A Multilingual Speech Corpus from Around the World

Researchers introduce WorldSpeech, a multilingual speech corpus containing 65,000 hours of aligned audio-transcript data across 76 languages, addressing the critical gap in ASR training data for low-resource languages. Fine-tuning existing ASR models on this dataset achieves an average 63.5% relative Word-Error-Rate reduction, significantly improving speech recognition accuracy for underrepresented languages.

AIBullisharXiv – CS AI · May 117/10
🧠

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Researchers introduce Video Understanding Reward Bench (VURB), a comprehensive benchmark with 2,100 preference pairs for evaluating video reward models, alongside VUP-35K, a large-scale dataset of 35,000 preference examples. Two new models, VideoDRM and VideoGRM, achieve state-of-the-art performance on video understanding tasks, advancing multimodal AI capabilities beyond text and images.

AIBullisharXiv – CS AI · May 117/10
🧠

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Researchers have developed an automated framework to generate a large-scale dataset of 163,000 molecule-description pairs by combining rule-based chemical nomenclature parsing with LLM guidance, achieving 98.6% precision in aligning molecular structures with natural language descriptions. This addresses a critical bottleneck in training language models for chemistry applications where manual annotation is prohibitively expensive.

🏢 Hugging Face
AINeutralarXiv – CS AI · May 97/10
🧠

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Researchers have developed TurnGate, a defense system that detects multi-turn dialogue attacks where malicious intent is distributed across multiple conversation turns rather than exposed in a single prompt. The study introduces the Multi-Turn Intent Dataset (MTID) and demonstrates that the system outperforms existing baselines while maintaining low false-positive refusal rates.

AIBullisharXiv – CS AI · Mar 267/10
🧠

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Researchers have released DanQing, a large-scale Chinese vision-language dataset containing 100 million high-quality image-text pairs curated from Common Crawl data. The dataset addresses the bottleneck in Chinese VLP development and demonstrates superior performance compared to existing Chinese datasets across various AI tasks.

AIBullisharXiv – CS AI · Mar 267/10
🧠

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Researchers released CUA-Suite, a comprehensive dataset featuring 55 hours of continuous video demonstrations across 87 desktop applications to train computer-use agents. The dataset addresses a critical bottleneck in developing AI agents that can automate complex desktop workflows, revealing current models struggle with ~60% task failure rates on professional applications.

AIBullisharXiv – CS AI · Mar 56/10
🧠

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Researchers introduce ToolVQA, a large-scale multimodal dataset with 23K instances designed to improve AI models' ability to use external tools for visual question answering. The dataset features real-world contexts and multi-step reasoning tasks, with fine-tuned 7B models outperforming GPT-3.5-turbo on various benchmarks.

AINeutralarXiv – CS AI · Mar 57/10
🧠

ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound

Researchers have released ERDES, the first open-access dataset of ocular ultrasound videos for detecting retinal detachment and macular status using machine learning. The dataset addresses a critical gap in automated medical diagnosis by enabling AI models to classify retinal detachment severity, which is essential for determining surgical urgency.

AINeutralarXiv – CS AI · Mar 46/102
🧠

AI-Generated Music Detection in Broadcast Monitoring

Researchers introduced AI-OpenBMAT, the first dataset designed for detecting AI-generated music in broadcast environments, revealing that existing detection models perform poorly when music appears as short excerpts or is masked by speech. The study found that state-of-the-art detection models' F1-scores dropped below 60% in challenging broadcast scenarios, highlighting significant limitations in current AI music detection technology.

AIBullisharXiv – CS AI · Mar 37/103
🧠

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Researchers introduce Robometer, a new framework for training robot reward models that combines progress tracking with trajectory comparisons to better learn from failed attempts. The system is trained on RBM-1M, a dataset of over one million robot trajectories including failures, and shows improved performance across diverse robotics applications.

AIBullisharXiv – CS AI · Mar 37/103
🧠

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Researchers introduce Kiwi-Edit, a new video editing architecture that combines instruction-based and reference-guided editing for more precise visual control. The team created RefVIE, a large-scale dataset for training, and achieved state-of-the-art results in controllable video editing through a unified approach that addresses limitations of natural language descriptions.

AIBullishHugging Face Blog · Aug 207/107
🧠

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

NVIDIA has released a massive 6 million sample multi-lingual reasoning dataset, representing a significant contribution to AI research and development. This dataset release could accelerate advances in AI reasoning capabilities across multiple languages and benefit the broader AI research community.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Researchers introduce TelecomTS, a large-scale observability dataset from 5G telecommunications networks designed to advance time series analysis and anomaly detection. The dataset addresses a critical gap in AI research by providing de-anonymized, scale-preserved metrics that reflect real-world system monitoring challenges, while benchmarking reveals that current foundation models struggle with the noisy, high-variance characteristics of enterprise observability data.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

Researchers introduce KOTOX, the first Korean-language dataset for detecting and neutralizing obfuscated toxic content in language models. The dataset addresses a critical gap by providing paired examples of normal, toxic, and obfuscated text, leveraging Korean's unique linguistic properties like agglutination and orthographic variation that enable easy toxicity disguise.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

Researchers have developed a new deepfake detection framework called T-AVFD that addresses a critical gap in audio-visual forgery detection by handling singing scenarios, where traditional cross-modal inconsistency methods fail. The study introduces the SHDF dataset and demonstrates improved detection performance across both talking and singing deepfakes through text-guided pattern learning.

AIBullisharXiv – CS AI · 4d ago6/10
🧠

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Researchers introduce HyperTrack, a large-scale dataset of 16,000+ mobile GUI navigation tasks across 650+ Chinese applications, and GUIEvalKit, an open-source benchmarking toolkit for evaluating Vision-Language Models. The study demonstrates that reinforcement-based finetuning substantially outperforms supervised learning for mobile automation tasks, with implications for developing more capable AI agents.

AINeutralarXiv – CS AI · May 126/10
🧠

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Researchers have created a multilingual text simplification corpus by collecting and aligning sentence-level data from comparable corpora across five languages (Catalan, English, French, Italian, and Spanish). The dataset addresses a critical gap in NLP resources for non-English languages and is publicly available for training and evaluating text simplification models.

AINeutralarXiv – CS AI · May 126/10
🧠

NaiAD: Initiate Data-Driven Research for LLM Advertising

Researchers introduce NaiAD, a comprehensive dataset of nearly 59,000 ad-embedded LLM responses designed to optimize advertising within AI systems while maintaining user experience. The framework uses mechanistic analysis to identify four semantic strategies for effective ad integration and employs human-calibrated scoring to enable independent control of user and commercial utility objectives.

AINeutralarXiv – CS AI · May 126/10
🧠

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

Researchers introduce VT-Bench, the first comprehensive benchmark for visual-tabular multi-modal learning, aggregating 14 datasets with 756K samples across 9 domains. The benchmark evaluates 23 models and reveals significant gaps in current approaches for combining image and tabular data, particularly in high-stakes sectors like healthcare.

AINeutralarXiv – CS AI · May 46/10
🧠

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Researchers introduce LEGIT, a 24K-instance legal reasoning dataset with hierarchical argument trees that serve as evaluation rubrics for LLM-generated legal reasoning. The study reveals that LLM legal reasoning performance depends critically on both issue coverage and correctness, with RAG and reinforcement learning offering complementary improvements.

Page 1 of 3Next →