AINeutralarXiv – CS AI · 14h ago6/10
🧠
Data filtering methods for training language models
Researchers compared two automatic label error detection methods—Confident Learning and Dataset Cartography—for filtering noisy training data in Russian text classification tasks. The study reveals that filtering effectiveness depends heavily on dataset characteristics, with significant improvements only on small, noisy datasets, while larger corpora with low noise show no benefit from filtering.