y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#dataset-release News & Analysis

19 articles tagged with #dataset-release. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles
AIBullisharXiv – CS AI · 1d ago7/10
🧠

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Researchers propose DeMix, a framework that uses model merging to efficiently determine optimal data mixtures for large language model pre-training without expensive repeated training cycles. The approach decouples the search process from training costs, enabling evaluation of multiple data combinations while also releasing a 22-token dataset to support open research.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Generative UI: LLMs are Effective UI Generators

Researchers demonstrate that modern LLMs can robustly generate custom user interfaces directly from prompts, moving beyond static markdown outputs. The approach shows emergent capabilities with results comparable to human-crafted designs in 50% of cases, accompanied by the release of PAGEN, a dataset for evaluating generative UI implementations.

AIBullisharXiv – CS AI · Mar 127/10
🧠

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

OpenAI researchers introduce IH-Challenge, a reinforcement learning dataset designed to improve instruction hierarchy in frontier LLMs. Fine-tuning GPT-5-Mini with this dataset improved robustness by 10% and significantly reduced unsafe behavior while maintaining helpfulness.

🏢 OpenAI🏢 Hugging Face🧠 GPT-5
AINeutralarXiv – CS AI · Mar 46/103
🧠

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

Researchers released the ERI benchmark, a comprehensive dataset spanning 9 engineering fields and 55 subdomains to evaluate large language models' engineering capabilities. The benchmark tested 7 LLMs across 57,750 records, revealing a clear three-tier performance structure with frontier models like GPT-5 and Claude Sonnet 4 significantly outperforming mid-tier and smaller models.

AIBullishOpenAI News · May 97/106
🧠

Language models can explain neurons in language models

Researchers used GPT-4 to automatically generate explanations for how individual neurons behave in large language models and to evaluate the quality of those explanations. They have released a comprehensive dataset containing explanations and quality scores for every neuron in GPT-2, advancing AI interpretability research.

AINeutralarXiv – CS AI · 1d ago6/10
🧠

FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning

Researchers introduce FAM-Bench, a multimodal benchmark dataset containing 2,500 expert-verified instances designed to evaluate AI models' ability to assess food suitability for specific health conditions. The benchmark addresses a gap in existing food AI systems by testing health-aware reasoning through dish suitability assessment and comparative analysis tasks across 13 diet-related conditions.

AINeutralarXiv – CS AI · 1d ago6/10
🧠

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Researchers released ImmigrationQA, a source-grounded dataset of 17,058 question-answer pairs covering U.S. immigration law, and fine-tuned a Llama 3.2 3B model using LoRA for legal assistance. The fine-tuned model achieved 27% relative improvement over base models but remains limited for complex legal reasoning, demonstrating both the potential and constraints of small language models in high-stakes legal domains.

🧠 Claude🧠 Sonnet🧠 Llama
AINeutralarXiv – CS AI · 5d ago6/10
🧠

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

Researchers have developed a diffusion-based model for generating handwritten Ukrainian text with style transfer capabilities, addressing a significant gap in non-Latin script generation. By constructing a 126,177-image Ukrainian dataset and retraining DiffusionPen without architectural changes, the model demonstrates that few-shot latent diffusion generalizes beyond Latin scripts to Cyrillic writing systems.

AINeutralarXiv – CS AI · 5d ago6/10
🧠

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

Researchers introduce SMILE-Next, a comprehensive dataset and specialized large language model framework for understanding laughter in real-world contexts. The work combines laughter detection, classification, and reasoning tasks with novel training techniques including laughter-specific self-instruction and a mixture-of-experts architecture to improve multimodal language model performance on this underexplored domain.

AINeutralarXiv – CS AI · 5d ago5/10
🧠

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.

🏢 Hugging Face
AINeutralarXiv – CS AI · 5d ago6/10
🧠

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

Researchers have released IPO-Toolkit and IPO-Dataset, a comprehensive open-source framework and dataset containing over 109,000 IPO filings from 1994-2026 with 76,000+ extracted images. The resource enables large-scale analysis of long, multimodal financial documents and reveals that state-of-the-art AI models often misalign with expert judgments on financial chart interpretation tasks.

AIBullisharXiv – CS AI · 6d ago6/10
🧠

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

Researchers introduce FAST-GOAL, a fine-tuning method that improves CLIP's ability to process lengthy text descriptions through global-local semantic alignment. The approach combines object detection with token-level similarity learning and introduces GLIT100k, a new dataset linking long captions to localized image-text pairs, demonstrating significant performance gains across multiple benchmarks.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Researchers introduce MeDial-Speech, a new 111+ hour speech dataset for training medical AI systems to conduct patient consultations across four health conditions. The study benchmarks state-of-the-art LLMs including Claude Sonnet 4, GPT-5 mini, and DeepSeek-V3, revealing that while Claude Sonnet 4 achieves 71-75% accuracy in medical dialogue tasks, all models exhibit significant overconfidence in their probabilistic predictions.

🏢 Hugging Face🧠 GPT-5🧠 Claude
AIBullisharXiv – CS AI · 6d ago6/10
🧠

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Researchers have released ParsVoice, a 2,200-hour Persian speech dataset with 1.36 million aligned segments from 1,815 speakers, making it 25 times larger than previous Persian TTS resources. The dataset was constructed using an automated pipeline combining ASR, fine-tuned language models, and quality assessment, and validation shows the corpus enables multi-speaker text-to-speech systems competitive with existing solutions.

🏢 Hugging Face
AINeutralarXiv – CS AI · May 116/10
🧠

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Researchers introduce SAM 3D Animal, a promptable framework for reconstructing multiple animals in 3D from single images, addressing key challenges like occlusion and species variation. The team also releases Herd3D, a new multi-animal dataset with over 5K images, achieving state-of-the-art results across multiple benchmarks.

AINeutralarXiv – CS AI · Apr 156/10
🧠

StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

StableSketcher is a novel AI framework that enhances diffusion models for generating pixel-based hand-drawn sketches with improved prompt fidelity. The approach combines fine-tuned variational autoencoders with a reinforcement learning reward function based on visual question answering, alongside a new SketchDUO dataset of instance-level sketches paired with captions and Q&A pairs.

🧠 Stable Diffusion
AINeutralarXiv – CS AI · Mar 176/10
🧠

MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

Researchers released MALINT, the first human-annotated English dataset for detecting disinformation and its malicious intent, developed with expert fact-checkers. The study benchmarked 12 language models and introduced intent-based inoculation techniques that improved zero-shot disinformation detection across six datasets, five LLMs, and seven languages.

🧠 Llama