Unifying and Optimizing Data Values for Selection via Sequential Decision-Making
Researchers propose a new framework that reinterprets data selection as a sequential decision-making problem rooted in dynamic programming, unifying existing methods like Data Shapley while revealing their limitations as myopic approximations. The work introduces a scalable bipartite graph-based approach that preserves submodular structure and demonstrates improvements on machine learning and LLM fine-tuning tasks.
This research addresses a fundamental gap in machine learning infrastructure by formalizing how data values should inform selection decisions. Rather than treating data valuation and selection as separate problems, the authors establish a principled connection through dynamic programming, providing theoretical justification for why certain valuation methods work and where they fail. This matters because data selection directly impacts model performance and training efficiency, particularly as datasets grow larger and computational costs increase.
The work builds on decades of valuation research in machine learning and game theory, where methods like Shapley values have gained popularity for explaining feature and sample importance. However, applying these static valuations to sequential selection problems introduces theoretical gaps that existing literature hasn't adequately addressed. By reframing selection as a sequential optimization problem, the authors expose how traditional approaches like Data Shapley function as linear approximations that ignore the dynamic nature of selection—explaining empirical suboptimality observed in practice.
For practitioners training large language models and building machine learning pipelines, this research offers both theoretical insights and practical tools. The proposed bipartite graph surrogate enables scalable selection with provable guarantees while maintaining submodular structure, addressing a critical bottleneck in efficient model training. As organizations increasingly recognize that data quality matters as much as quantity, improved selection methods reduce training costs and accelerate experimentation cycles.
Future developments should focus on extending this framework to non-submodular selection scenarios and validating improvements on additional domain-specific benchmarks beyond the tested classical ML and LLM fine-tuning tasks.
- →Data selection reformulated as sequential decision-making reveals existing methods like Data Shapley as myopic linear approximations with theoretical limitations.
- →Proposed bipartite graph-based surrogate enables scalable greedy selection while preserving submodular structure with provable performance guarantees.
- →Framework explains when and why traditional data valuation methods fail under utility curvature, bridging theory-practice gap.
- →Empirical improvements demonstrated on classical ML benchmarks and large-scale LLM fine-tuning tasks compared to existing approaches.
- →Public code availability supports adoption and further development in data selection methodologies.