#dataset-curation News & Analysis

6 articles tagged with #dataset-curation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBearisharXiv – CS AI · May 47/10

🧠

The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

Researchers audited LAION-Aesthetics Predictor (LAP), an algorithmic model widely used to filter training datasets for visual generative AI systems like Stable Diffusion. The audit reveals LAP systematically biases toward images of women while filtering out men and LGBTQ+ individuals, and reinforces Western artistic preferences, raising critical questions about whose aesthetic values shape AI-generated imagery.

🧠 Stable Diffusion

AIBullisharXiv – CS AI · Mar 47/103

🧠

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Researchers introduce Skywork-Reward-V2, a suite of AI reward models trained on SynPref-40M, a massive 40-million preference pair dataset created through human-AI collaboration. The models achieve state-of-the-art performance across seven major benchmarks by combining human annotation quality with AI scalability for better preference learning.

AINeutralarXiv – CS AI · Jun 96/10

🧠

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

Researchers introduce SlideCheck, a data guidance tool for pathology foundation models that uses frozen model features to score and curate pretraining datasets. The system provides abnormality and malignancy scores to help organize and audit WSI-derived patch data, demonstrating that controlled dataset composition significantly influences downstream self-supervised learning outcomes.

AINeutralarXiv – CS AI · Jun 96/10

🧠

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov introduces a curated trilingual dataset (English-Hindi-Marathi) containing 8,000 parallel sentence pairs focused on Indian agricultural government schemes and farmer welfare programs. The dataset combines automated data collection, machine translation, and human post-editing to create domain-specific resources for machine translation, question-answering, and information retrieval systems aimed at farmer-facing applications.

AINeutralarXiv – CS AI · Mar 36/107

🧠

Challenges in Enabling Private Data Valuation

Researchers identify fundamental conflicts between data privacy and data valuation methods used in AI training. The study shows that differential privacy requirements often destroy the fine-grained distinctions needed for effective data valuation, particularly for rare or influential examples.

AIBullishOpenAI News · Jun 106/105

🧠

Improving language model behavior by training on a curated dataset

Researchers have discovered that language model behavior can be improved for specific behavioral values through fine-tuning on small, curated datasets. This approach offers a more efficient method for aligning AI models with desired behavioral outcomes without requiring massive training resources.