y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#data-selection News & Analysis

19 articles tagged with #data-selection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles
AINeutralarXiv – CS AI · 3d ago7/10
🧠

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

A comprehensive survey examines how data efficiency, memory constraints, and compute budgets interact as coupled bottlenecks in LLM training. The research reveals that optimal training strategies are resource-dependent rather than universal, with GPU memory often being the primary limiting factor rather than raw computational power.

AIBullisharXiv – CS AI · Jun 17/10
🧠

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Researchers introduce PRISM, a training-free framework for efficiently selecting visual instruction data for multimodal language models that reduces computational costs to 30% of conventional pipelines while improving performance across multiple benchmarks. The method addresses global semantic drift caused by anisotropic visual feature distributions, enabling more efficient model fine-tuning without sacrificing quality.

AIBullisharXiv – CS AI · May 277/10
🧠

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

Researchers introduce the Mimic Score, a geometry-based metric for evaluating data quality in large datasets by measuring gradient alignment with pre-trained models. The proposed Grad-Mimic framework enables efficient data selection, reducing training steps for CLIP models by 20.7% and filtering datasets without expensive computations or validation sets.

AIBullisharXiv – CS AI · May 127/10
🧠

Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies

Researchers propose a framework for optimizing data selection in large language model instruction tuning by learning task-specific and model-specific weights for multiple quality indicators. Using efficient in-context learning signals on small validation sets, the method achieves comparable performance to full-dataset training with only 30% of samples, revealing important trade-offs between semantic diversity and logical complexity.

🧠 Llama
AIBullisharXiv – CS AI · May 117/10
🧠

Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

Researchers introduce One-Step-Train (OST), a new data selection framework for Large Multimodal Models that uses incremental optimization to identify high-quality training samples. The method reduces computational costs by 43% while outperforming existing approaches like LLM-as-a-Judge, demonstrating significant efficiency gains in multimodal model training.

AIBullisharXiv – CS AI · Apr 107/10
🧠

SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training

Researchers introduce SPICE, a data selection algorithm that reduces large language model training data requirements by 90% while maintaining performance by identifying and minimizing gradient conflicts between training samples. The method combines information-theoretic principles with practical efficiency improvements, enabling effective model tuning on just 10% of typical datasets across multiple benchmarks.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

Researchers propose an active learning framework that combines foundation model priors with smaller models to address class imbalance and label noise in real-world datasets. The method achieves over 50% annotation savings compared to existing active learning baselines while maintaining model performance across image and text domains.

AIBullisharXiv – CS AI · 4d ago6/10
🧠

HARP: Efficient Data Selection for Finetuning Large Language Models

Researchers introduce HARP (Hierarchical Active Region Pruning), a novel training-efficient method for selecting optimal data when finetuning large language models. The approach reduces computational costs by 7x while maintaining or improving model performance by using hierarchical organization and Bayesian inference to evaluate representative subsets rather than exhaustively training on all data.

AINeutralarXiv – CS AI · Jun 26/10
🧠

Adaptive data selection improves wearable prediction under low baseline performance

Researchers demonstrate that adaptive data selection strategies significantly improve machine learning prediction performance in wearable health systems, but primarily benefit individuals with initially poor baseline performance rather than those already performing well. The findings suggest selective deployment of adaptive sensing based on baseline metrics could optimize resource allocation in health monitoring applications.

AINeutralarXiv – CS AI · Jun 16/10
🧠

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

LARK introduces a learnability-grounded approach to trajectory selection for reasoning distillation, enabling student models to learn more efficiently from teacher-generated reasoning paths. The method uses a learnability factor to identify trajectories that maximize learning speed while maintaining distributional coverage, outperforming existing heuristic-based selection methods across multiple reasoning tasks.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Unifying and Optimizing Data Values for Selection via Sequential Decision-Making

Researchers propose a new framework that reinterprets data selection as a sequential decision-making problem rooted in dynamic programming, unifying existing methods like Data Shapley while revealing their limitations as myopic approximations. The work introduces a scalable bipartite graph-based approach that preserves submodular structure and demonstrates improvements on machine learning and LLM fine-tuning tasks.

AINeutralarXiv – CS AI · May 296/10
🧠

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

Researchers introduce the Data-Model Compatibility (DMC) metric to evaluate how well training datasets align with student models during reasoning distillation from large language models. The metric jointly assesses data quality, difficulty, and student capability, demonstrating strong correlation with distillation performance and enabling dynamic dataset selection that improves outcomes across multiple models and tasks.

AINeutralarXiv – CS AI · May 296/10
🧠

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Researchers introduce MIRA, a framework for optimizing data selection during mid-training of large language models by dynamically discovering and applying source-specific evaluation rubrics. The approach achieves comparable performance to full-corpus training while reducing token usage by 50% on code-oriented tasks across 21 diverse data sources.

AINeutralarXiv – CS AI · May 286/10
🧠

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.

🧠 Llama
AINeutralarXiv – CS AI · Apr 206/10
🧠

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

Researchers introduce SAI-DPO, a dynamic data sampling framework that adapts training data selection based on a model's evolving capabilities during training, rather than using static metrics. Tested on mathematical reasoning benchmarks including AIME24 and AMC23, SAI-DPO achieves state-of-the-art performance with significantly less training data, outperforming baselines by nearly 6 points.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Data Selection for Multi-turn Dialogue Instruction Tuning

Researchers propose MDS (Multi-turn Dialogue Selection), a framework for improving instruction-tuned language models by intelligently selecting high-quality multi-turn dialogue data. The method combines global coverage analysis with local structural evaluation to filter noisy datasets, demonstrating superior performance across multiple benchmarks compared to existing selection approaches.

AINeutralarXiv – CS AI · Apr 106/10
🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.

AIBullisharXiv – CS AI · Mar 116/10
🧠

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Researchers propose CVS, a training-free method for selecting high-quality vision-language training data that requires genuine cross-modal reasoning. The method achieves better performance using only 10-15% of data compared to full dataset training, while reducing computational costs by up to 44%.

AIBullisharXiv – CS AI · Mar 37/106
🧠

Token-level Data Selection for Safe LLM Fine-tuning

Researchers have developed TOSS, a new framework for safely fine-tuning large language models that operates at the token level rather than sample level. The method identifies and removes unsafe tokens while preserving task-specific information, demonstrating superior performance compared to existing sample-level defense methods in maintaining both safety and utility.