y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#data-quality News & Analysis

13 articles tagged with #data-quality. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles
AINeutralarXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

Drift and selection in LLM text ecosystems

Researchers develop a mathematical framework showing how AI-generated text recursively shapes training corpora through drift and selection mechanisms. The study demonstrates that unfiltered reuse of generated content degrades linguistic diversity, while selective publication based on quality metrics can preserve structural complexity in training data.

AIBullishCrypto Briefing ยท 5d ago7/10
๐Ÿง 

Marco Argenti: AI will disrupt legacy software companies by 2026, the importance of data quality for effective AI, and how AI is evolving into a powerful personal assistant | Odd Lots

Marco Argenti predicts that AI will significantly disrupt legacy software companies by 2026, while emphasizing the critical role of data quality in AI effectiveness. The analysis explores how AI is evolving into a sophisticated personal assistant and reshaping developer roles across the industry.

Marco Argenti: AI will disrupt legacy software companies by 2026, the importance of data quality for effective AI, and how AI is evolving into a powerful personal assistant | Odd Lots
AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Researchers propose Online Label Refinement (OLR) to improve AI reasoning models' robustness under noisy supervision in Reinforcement Learning with Verifiable Rewards. The method addresses the critical problem of training language models when expert-labeled data contains errors, achieving 3-4% performance gains across mathematical reasoning benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.

CryptoBullishCrypto Briefing ยท 5d ago6/10
โ›“๏ธ

Alex Svanevik: Nansen excels in blockchain data attribution, the importance of quality assurance in labeling, and how data harmonization drives insights | Epicenter

Alex Svanevik of Nansen discusses the company's advanced labeling techniques for blockchain data attribution and the critical role of quality assurance in transforming raw on-chain data into actionable insights for cryptocurrency traders and investors. Svanevik emphasizes how data harmonization and rigorous labeling standards enable market participants to make more informed decisions.

Alex Svanevik: Nansen excels in blockchain data attribution, the importance of quality assurance in labeling, and how data harmonization drives insights | Epicenter
AINeutralarXiv โ€“ CS AI ยท Feb 276/104
๐Ÿง 

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Researchers propose using psychometric modeling to correct systematic biases in human evaluations of AI systems, demonstrating how Item Response Theory can separate true AI output quality from rater behavior inconsistencies. The approach was tested on OpenAI's summarization dataset and showed improved reliability in measuring AI model performance.

AINeutralarXiv โ€“ CS AI ยท Mar 44/103
๐Ÿง 

Learning to Pay Attention: Unsupervised Modeling of Attentive and Inattentive Respondents in Survey Data

Researchers developed an unsupervised machine learning framework using autoencoders and probabilistic models to detect inattentive survey respondents without traditional attention checks. The study found that survey structure is more important than model complexity for detection effectiveness, with well-designed instruments enabling reliable identification of low-quality responses.

AINeutralLil'Log (Lilian Weng) ยท Feb 54/10
๐Ÿง 

Thinking about High-Quality Human Data

The article discusses the critical importance of high-quality human-labeled data for training modern deep learning models, particularly for classification tasks and RLHF labeling used in LLM alignment. Despite the recognized value of quality data, there's a notable preference in the ML community for model development work over data collection and annotation work.

AINeutralarXiv โ€“ CS AI ยท Mar 34/104
๐Ÿง 

USE: Uncertainty Structure Estimation for Robust Semi-Supervised Learning

Researchers introduce Uncertainty Structure Estimation (USE), a new preprocessing method for semi-supervised learning that improves model reliability by filtering out low-quality unlabeled data. The approach uses entropy scores and statistical thresholds to identify and remove out-of-distribution samples before training, demonstrating consistent accuracy improvements across imaging and NLP tasks.

$NEAR
AINeutralarXiv โ€“ CS AI ยท Mar 24/105
๐Ÿง 

MEDIC: a network for monitoring data quality in collider experiments

Researchers have developed MEDIC, a neural network framework for Data Quality Monitoring (DQM) in particle physics experiments that uses machine learning to automatically detect detector anomalies and identify malfunctioning components. The simulation-driven approach using modified Delphes detector simulation represents an initial step toward comprehensive ML-based DQM systems for future particle detectors.