y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#data-selection News & Analysis

5 articles tagged with #data-selection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBullisharXiv โ€“ CS AI ยท Apr 107/10
๐Ÿง 

SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training

Researchers introduce SPICE, a data selection algorithm that reduces large language model training data requirements by 90% while maintaining performance by identifying and minimizing gradient conflicts between training samples. The method combines information-theoretic principles with practical efficiency improvements, enabling effective model tuning on just 10% of typical datasets across multiple benchmarks.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Data Selection for Multi-turn Dialogue Instruction Tuning

Researchers propose MDS (Multi-turn Dialogue Selection), a framework for improving instruction-tuned language models by intelligently selecting high-quality multi-turn dialogue data. The method combines global coverage analysis with local structural evaluation to filter noisy datasets, demonstrating superior performance across multiple benchmarks compared to existing selection approaches.

AINeutralarXiv โ€“ CS AI ยท Apr 106/10
๐Ÿง 

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Researchers propose CVS, a training-free method for selecting high-quality vision-language training data that requires genuine cross-modal reasoning. The method achieves better performance using only 10-15% of data compared to full dataset training, while reducing computational costs by up to 44%.

AIBullisharXiv โ€“ CS AI ยท Mar 37/106
๐Ÿง 

Token-level Data Selection for Safe LLM Fine-tuning

Researchers have developed TOSS, a new framework for safely fine-tuning large language models that operates at the token level rather than sample level. The method identifies and removes unsafe tokens while preserving task-specific information, demonstrating superior performance compared to existing sample-level defense methods in maintaining both safety and utility.