y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#dataset-release News & Analysis

27 articles tagged with #dataset-release. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

27 articles
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

Researchers introduce Audio-FLAN, a large-scale instruction-tuning dataset with over 100 million instances covering 80 diverse tasks across speech, music, and sound domains. This dataset addresses a critical gap in unified audio-language models by enabling both audio understanding and generation tasks, advancing the integration of audio capabilities into large language models.

🏢 Hugging Face
AIBullisharXiv – CS AI · 5d ago7/10
🧠

FIGMA: Towards FIne-Grained Music retrievAl

Researchers introduce FIGMA, a new multi-view contrastive learning architecture that significantly improves music retrieval based on fine-grained musical attributes like tempo, key, and chord progression. The work addresses a fundamental limitation in existing CLAP-based models that fail to process detailed musical descriptions, achieving up to 73.3% relative improvement and contributing a new 380K music-caption dataset (FGMCaps) to the field.

AINeutralarXiv – CS AI · Jun 27/10
🧠

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree

Researchers released ClawHub Security Signals, a dataset of 67,453 AI agent skills analyzed by three security scanners, revealing significant disagreement among detection methods. Only 0.69% of skills were flagged by all three scanners, indicating that single-scanner verdicts are insufficient for securing AI agent ecosystems and requiring layered security governance instead.

🏢 Nvidia
AIBullisharXiv – CS AI · Jun 17/10
🧠

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Researchers propose DeMix, a framework that uses model merging to efficiently determine optimal data mixtures for large language model pre-training without expensive repeated training cycles. The approach decouples the search process from training costs, enabling evaluation of multiple data combinations while also releasing a 22-token dataset to support open research.

AIBullisharXiv – CS AI · May 287/10
🧠

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Generative UI: LLMs are Effective UI Generators

Researchers demonstrate that modern LLMs can robustly generate custom user interfaces directly from prompts, moving beyond static markdown outputs. The approach shows emergent capabilities with results comparable to human-crafted designs in 50% of cases, accompanied by the release of PAGEN, a dataset for evaluating generative UI implementations.

AIBullisharXiv – CS AI · Mar 127/10
🧠

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

OpenAI researchers introduce IH-Challenge, a reinforcement learning dataset designed to improve instruction hierarchy in frontier LLMs. Fine-tuning GPT-5-Mini with this dataset improved robustness by 10% and significantly reduced unsafe behavior while maintaining helpfulness.

🏢 OpenAI🏢 Hugging Face🧠 GPT-5
AINeutralarXiv – CS AI · Mar 46/103
🧠

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

Researchers released the ERI benchmark, a comprehensive dataset spanning 9 engineering fields and 55 subdomains to evaluate large language models' engineering capabilities. The benchmark tested 7 LLMs across 57,750 records, revealing a clear three-tier performance structure with frontier models like GPT-5 and Claude Sonnet 4 significantly outperforming mid-tier and smaller models.

AIBullishOpenAI News · May 97/106
🧠

Language models can explain neurons in language models

Researchers used GPT-4 to automatically generate explanations for how individual neurons behave in large language models and to evaluate the quality of those explanations. They have released a comprehensive dataset containing explanations and quality scores for every neuron in GPT-2, advancing AI interpretability research.

AINeutralarXiv – CS AI · 3d ago5/10
🧠

Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Researchers introduce Monte Carlo Pass Search (MCPS), a novel AI system that evaluates football passes by simulating counterfactual scenarios using trajectory generation and value prediction models. The work combines existing machine learning techniques with a new public Bundesliga dataset featuring 3D ball tracking, enabling distribution-aware analysis of pass execution quality and decision-making.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Bidirectional Small-Granularity Search between Code and Text

Researchers introduce a bidirectional search task linking code snippets with text descriptions and vice versa, addressing the gap between scientific publications and their implementations. They present a large dataset with automatically-generated training data and manually-annotated test sets, along with a modular encoder-based approach that achieves strong in-domain results with promising out-of-domain generalization.

🧠 GPT-4
AIBullisharXiv – CS AI · Jun 56/10
🧠

Personal AI Agent for Camera Roll VQA

Researchers introduce camroll, a dataset and AI agent system designed to answer questions about personal photo libraries by retrieving and analyzing relevant images from users' camera rolls. The camroll-agent uses hierarchical memory and specialized tools to handle long-context visual reasoning across thousands of personalized images, outperforming existing baselines in understanding user-specific visual content.

AINeutralarXiv – CS AI · Jun 56/10
🧠

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

Researchers introduce HomeWorld, a unified framework for generating complete, furnished home scenes from floorplans using hierarchical AI models. The system combines large language models for floorplan generation, image models for furniture layout, and vision-language models for iterative refinement, producing simulation-ready indoor environments with a dataset of 300K real floorplans and 5K fully furnished scenes.

AINeutralarXiv – CS AI · Jun 26/10
🧠

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

Researchers introduce ODTQA-FoRe, a new dataset and TimeFore framework enabling large language models to perform future-oriented numerical predictions on tabular data using time-series forecasting. The innovation addresses a critical gap where existing LLM systems excel at historical analysis but struggle with predictive reasoning, demonstrated through real estate data scenarios.

AINeutralarXiv – CS AI · Jun 16/10
🧠

FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning

Researchers introduce FAM-Bench, a multimodal benchmark dataset containing 2,500 expert-verified instances designed to evaluate AI models' ability to assess food suitability for specific health conditions. The benchmark addresses a gap in existing food AI systems by testing health-aware reasoning through dish suitability assessment and comparative analysis tasks across 13 diet-related conditions.

AINeutralarXiv – CS AI · Jun 16/10
🧠

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Researchers released ImmigrationQA, a source-grounded dataset of 17,058 question-answer pairs covering U.S. immigration law, and fine-tuned a Llama 3.2 3B model using LoRA for legal assistance. The fine-tuned model achieved 27% relative improvement over base models but remains limited for complex legal reasoning, demonstrating both the potential and constraints of small language models in high-stakes legal domains.

🧠 Claude🧠 Sonnet🧠 Llama
AINeutralarXiv – CS AI · May 286/10
🧠

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

Researchers have developed a diffusion-based model for generating handwritten Ukrainian text with style transfer capabilities, addressing a significant gap in non-Latin script generation. By constructing a 126,177-image Ukrainian dataset and retraining DiffusionPen without architectural changes, the model demonstrates that few-shot latent diffusion generalizes beyond Latin scripts to Cyrillic writing systems.

AINeutralarXiv – CS AI · May 286/10
🧠

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

Researchers introduce SMILE-Next, a comprehensive dataset and specialized large language model framework for understanding laughter in real-world contexts. The work combines laughter detection, classification, and reasoning tasks with novel training techniques including laughter-specific self-instruction and a mixture-of-experts architecture to improve multimodal language model performance on this underexplored domain.

AINeutralarXiv – CS AI · May 285/10
🧠

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.

🏢 Hugging Face
AINeutralarXiv – CS AI · May 286/10
🧠

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

Researchers have released IPO-Toolkit and IPO-Dataset, a comprehensive open-source framework and dataset containing over 109,000 IPO filings from 1994-2026 with 76,000+ extracted images. The resource enables large-scale analysis of long, multimodal financial documents and reveals that state-of-the-art AI models often misalign with expert judgments on financial chart interpretation tasks.

AIBullisharXiv – CS AI · May 276/10
🧠

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

Researchers introduce FAST-GOAL, a fine-tuning method that improves CLIP's ability to process lengthy text descriptions through global-local semantic alignment. The approach combines object detection with token-level similarity learning and introduces GLIT100k, a new dataset linking long captions to localized image-text pairs, demonstrating significant performance gains across multiple benchmarks.

AINeutralarXiv – CS AI · May 276/10
🧠

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Researchers introduce MeDial-Speech, a new 111+ hour speech dataset for training medical AI systems to conduct patient consultations across four health conditions. The study benchmarks state-of-the-art LLMs including Claude Sonnet 4, GPT-5 mini, and DeepSeek-V3, revealing that while Claude Sonnet 4 achieves 71-75% accuracy in medical dialogue tasks, all models exhibit significant overconfidence in their probabilistic predictions.

🏢 Hugging Face🧠 GPT-5🧠 Claude
AIBullisharXiv – CS AI · May 276/10
🧠

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Researchers have released ParsVoice, a 2,200-hour Persian speech dataset with 1.36 million aligned segments from 1,815 speakers, making it 25 times larger than previous Persian TTS resources. The dataset was constructed using an automated pipeline combining ASR, fine-tuned language models, and quality assessment, and validation shows the corpus enables multi-speaker text-to-speech systems competitive with existing solutions.

🏢 Hugging Face
AINeutralarXiv – CS AI · May 116/10
🧠

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Researchers introduce SAM 3D Animal, a promptable framework for reconstructing multiple animals in 3D from single images, addressing key challenges like occlusion and species variation. The team also releases Herd3D, a new multi-animal dataset with over 5K images, achieving state-of-the-art results across multiple benchmarks.

Page 1 of 2Next →