#dataset-quality News & Analysis

7 articles tagged with #dataset-quality. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

Researchers used GPT-5.4 to identify labeling errors in CT-RATE, a large-scale chest CT dataset containing 24,434 radiology reports and 439,812 label instances. The LLM-assisted cleaning achieved 96.4% agreement with existing labels, with radiologists validating that the model correctly identified discordances in 74-92% of flagged cases, demonstrating potential for scalable dataset quality improvement.

🏢 Microsoft🧠 GPT-5

AIBullisharXiv – CS AI · Jun 17/10

🧠

Joint angle based learning to refine kinematic human pose estimation

Researchers propose a joint angle-based learning method to refine human pose estimation (HPE) by leveraging kinematic constraints and Fourier series approximation, addressing keypoint recognition errors and trajectory fluctuations. The approach demonstrates superior performance in challenging motion scenarios like figure skating and breaking, offering potential applications across sports analysis, healthcare, and motion capture industries.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Data Selection Through Iterative Self-Filtering for Vision-Language Settings

Researchers propose a Self-Filtering method that trains CLIP vision-language models on dynamically evolving datasets by iteratively balancing clean samples with diverse data. This bootstrapped approach improves model performance without requiring additional data or pre-trained models, addressing the challenge of training on large-scale noisy datasets.

AINeutralarXiv – CS AI · Jun 195/10

🧠

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

Researchers introduce PrefSQA, a machine learning method that predicts speech quality through pairwise preference comparisons rather than traditional mean opinion scores (MOS). The approach incorporates uncertainty-aware logits and attention mechanisms, demonstrating that preference-based labeling produces cleaner, more reliable datasets than scalar MOS ratings, though improvements vary significantly based on dataset quality.

AINeutralarXiv – CS AI · Jun 126/10

🧠

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

Researchers present a framework for evaluating procedural reasoning datasets in AI-supported learning systems by comparing three question-generation strategies based on Task-Method-Knowledge (TMK) models. The study demonstrates that strict TMK generation produces the most grounded and usable datasets (96.5% grounded), while transcript-based approaches sacrifice representational alignment for naturalness, highlighting the trade-off between learner-like phrasing and formal grounding in evaluation dataset construction.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Researchers introduce a multi-agent framework to map data lineage in large language models, revealing how post-training datasets evolve and interconnect. The analysis uncovers structural redundancy, benchmark contamination propagation, and proposes lineage-aware dataset construction to improve LLM training diversity and quality.

AINeutralarXiv – CS AI · Apr 106/10

🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.