How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Researchers demonstrate that neural scaling laws and the Vendi Score—two methods for evaluating dataset quality—are both submodular functions, enabling optimization via a broader class of matrix spectral functions. By developing efficient secular-equation-based updates, they achieve 35,000x speedup in computations, making direct optimization feasible on large-scale datasets and revealing that facility location outperforms other objectives for predicting training subset value.
This research addresses a fundamental challenge in machine learning: quantifying dataset value beyond simple size metrics. The work bridges theoretical understanding with practical computation, showing that popular data valuation approaches share mathematical structure through submodularity. This discovery matters because it unifies previously disparate methods under a common framework, enabling researchers to reason about their strengths and limitations more systematically.
The efficiency breakthrough deserves particular attention. By avoiding repeated eigendecompositions during optimization—reducing marginal-gain evaluation by an O(m) factor—the authors enable direct optimization on ImageNet-1K-scale datasets where it was previously infeasible. This computational acceleration democratizes access to sophisticated data valuation techniques beyond well-resourced research labs.
The empirical findings challenge conventional assumptions about data quality. The research reveals that the Vendi Score, despite theoretical elegance, becomes a poor proxy for downstream performance when pushed to extreme values. More surprisingly, uniformly random subsets show remarkable concentration in both appraisal scores and held-out performance, suggesting that randomness provides a stronger baseline than previously recognized. Facility location consistently outperforming alternatives indicates that diversity-based selection remains more practically valuable than entropy-based approaches.
These insights reshape how practitioners should approach dataset curation. The finding that size, class balance, and training budget alone don't determine data value—even when controlled for—indicates that subset selection requires domain-specific optimization rather than rule-of-thumb heuristics. The work suggests future research should focus on understanding why facility location succeeds and how to incorporate domain knowledge into spectral function selection.
- →Neural scaling laws and Vendi Score are both submodular functions, part of a broader matrix spectral function family enabling unified optimization.
- →Secular-equation-based updates achieve 35,000x speedup, making large-scale dataset optimization feasible where eigendecomposition was previously prohibitive.
- →Facility location consistently outperforms entropy-based Vendi Score and DPP methods across multiple datasets and experimental conditions.
- →Vendi Score becomes unreliable as a performance proxy when optimized to extreme values, despite theoretical elegance.
- →Random subset selection concentrates remarkably in both appraisal scores and performance, suggesting stronger baseline assumptions than previously believed.