y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#dataset-quality News & Analysis

2 articles tagged with #dataset-quality. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · Apr 146/10
🧠

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Researchers introduce a multi-agent framework to map data lineage in large language models, revealing how post-training datasets evolve and interconnect. The analysis uncovers structural redundancy, benchmark contamination propagation, and proposes lineage-aware dataset construction to improve LLM training diversity and quality.

AINeutralarXiv – CS AI · Apr 106/10
🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.