🧠 AI⚪ NeutralImportance 6/10

Data filtering methods for training language models

arXiv – CS AI|Egor Shevchenko, Elena Bruches|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers compared two automatic label error detection methods—Confident Learning and Dataset Cartography—for filtering noisy training data in Russian text classification tasks. The study reveals that filtering effectiveness depends heavily on dataset characteristics, with significant improvements only on small, noisy datasets, while larger corpora with low noise show no benefit from filtering.

Analysis

This research addresses a fundamental challenge in machine learning: training data quality directly impacts model performance. Label errors—incorrect annotations present even in benchmark datasets—introduce noise that degrades model generalization. The comparative analysis of Confident Learning and Dataset Cartography across three Russian language corpora with different characteristics provides practical insights into when and how automated filtering techniques prove valuable.

The findings demonstrate that neither method works universally. On ru_emotion_e-culture with 49,123 examples, filtering provided minimal gains due to already low noise levels. However, on smaller datasets like TERRa with only 2,337 examples, Confident Learning achieved significant F1-macro improvements by identifying and removing mislabeled examples. Dataset Cartography showed more conservative removal patterns across all corpora, suggesting different risk profiles for practitioners choosing between methods.

The control experiments—removing equivalent numbers of random examples—establish that improvements from both methods reflect genuine signal detection rather than simple data reduction benefits. This validates the meaningfulness of the filtering approaches and their ability to distinguish harmful noise from valuable training examples.

These results have implications for AI development teams working with limited budgets. For small datasets common in specialized domains or low-resource languages, automated label error detection becomes a practical tool for improving model quality without expensive manual annotation review. However, teams managing large, carefully curated datasets may see minimal returns from such filtering, suggesting resources are better allocated elsewhere in the development pipeline.

Key Takeaways

→Confident Learning delivers significant F1-macro improvements on small, high-noise datasets while remaining ineffective on large, low-noise corpora
→Dataset Cartography exhibits more conservative filtering behavior, removing fewer examples across all dataset types
→Filtering effectiveness depends strongly on dataset characteristics including size, noise level, and domain complexity
→Automated label error detection outperforms random data removal, validating the meaningfulness of both methods
→Small-dataset practitioners should prioritize label error detection while large-dataset teams may benefit more from other optimization approaches